Why There's no such thing as clean data
Dear Advisor: How do I get clean data?
June 2021, updated June 2022
Every company has data challenges (no joke!). What that boils down to, is that throughout my career as a data expert, there’s 2 questions that comes up again and again (from founders and data professionals):
“I hear data professionals spend 90% of their time doing data cleaning -- how do I get clean data to work with from the very beginning?” and
"So you have your PhD in Statistics. You're a data expert. Can you we get your stamp of approval that our data's perfect?"
These are great questions! I’m here to disappoint you... because I'm here to tell you... clean/perfect data just doesn't exist -- and that’s actually good thing (!). Here’s why:
Best Case Scenario: Incomplete Data is Expected
It's expected because it reflects the underlying issues that your customers are experiencing -- and that’s a good thing.
Example: Customers can’t place orders. If you’re not tracking the customer journey in your app/product, you won’t know if there’s a bug in the sign-in page/checkout process.
Example: IoT device got disconnected
In all other cases -- clean data is not a good thing.
Typical Scenario 1: Uninspected Data
You’re tracking seemingly everything, but not checking it as it’s coming in, from the very beginning; so you don't know how clean or complete it is. Jeff Wilke argues that this "uninspected data is always wrong".
Example: Did you know that you could spell Philadelphia 57 different ways (when you were applying for the PPP loan)?
Example: When I was a management consultant at Slalom, One of our clients was a major retailer who struggled to target their repeat customers -- because they only tracked purchases, not customers.
Typical Scenario 2: (Partially) Incomplete Data
On the flip side, is it possible that you're not tracking all of the product and customer touch-points?
Typical Scenario 3: Biased Data
Or you're only tracking certain (customer) events/outcomes?
Example: Do you track prospective customers’ activity on your website, before they create a log-in/sign-in?
Example: Not tracking which survey/promotion got sent to which customer, to be able to evaluate ROI on Marketing.
Typical Scenario 4: No Data
Or you don't know how to get started?
Other Reasons for "Weirdness" in Data:
In my experience, the most common reasons for any weirdness in the data may be due to:
different teams having different definitions, or
things not getting connected correctly, or
things not getting synced at the same time, or
significant product changes, or
users that are a part of multiple A/B tests at the same time.
Recommended Next Steps
We know everyone’s data has problems. How long will it take to fix my (fixable) data issues? That depends… on what’s missing, available to capture/ingest, developer skill set, etc.
What does that actually look like? Using the following steps, you'll end up asking yourself "how can our data answer the business question? " -- not "how clean data should I have before we start making data-driven decisions?"
First, here's my advice for Getting started making data-driven decisions ASAP -- ideally, with your alpha users, but it's not too late to start now.
Check for quality. Even if you have lots of data, I recommend this process:
Evaluate 1 customer’s journey or 1 asset’s touchpoints
Evaluate all of events in small unit of time
Check that everything is captured -- and is as expected?
Try to understand all the workflows in product/platform -- and see if you’re able to track everything
Remember to start small and iterate based on what you learned, working towards real-time data quality monitoring, validating and alerting in place in case something happens -- and not towards perfectly clean data.
Do you need an expert to help you execute these steps, to help your company make better data-driven decisions? Please reach out.
You may also like:
(Step 1) Dear Advisor: How do I get Started on a Data Strategy from the Beginning?
(Step 2) Dear Advisor: I’d like to make data-driven decisions. Where do I start?
Dear Advisor: Who should be my first data hire?
The Data Quality Flywheel, by Michael Kaminski: https://www.datafold.com/blog/the-data-quality-flywheel/ -- which mentions dbt for evaluating data quality
4 Data Quality Best Practices to Help You Get to Data ROI Faster, by Zoe Hawkins