Dear Advisor: How do I get clean data?
Or: Why it's good that clean data does not exist?
Throughout my career as a data expert, there’s 1 question that comes up again and again (from founders and data professionals):
“I hear data professionals spend 90% of their time doing data cleaning -- how do I get clean data to work with from the very beginning?”
It's a great question! I’m here to tell you -- you can’t -- and that’s actually good thing (!). Here’s why:
Best Case Scenario: Incomplete Data is Expected
It's expected because it reflects the underlying issues that your customers are experiencing -- and that’s a good thing.
Example: Customers can’t place orders. If you’re not tracking the customer journey in your app/product, you won’t know if there’s a bug in the sign-in page/checkout process.
Example: IoT device got disconnected
In all other cases -- clean data is not a good thing.
Typical Scenario 1: Uninspected Data
You’re tracking seemingly everything, but not checking it as it’s coming in, from the very beginning; so you don't know how clean or complete it is. Jeff Wilke argues that this "uninspected data is always wrong".
Example: Did you know that you could spell Philadelphia 57 different ways (when you were applying for the PPP loan)?
Example: When I was a management consultant at Slalom, One of our clients was a major retailer who struggled to target their repeat customers -- because they only tracked purchases, not customers.
Typical Scenario 2: (Partially) Incomplete Data
On the flip side, is it possible that you're not tracking all of the product and customer touch-points?
Typical Scenario 3: Biased Data
Or you're only tracking certain (customer) events/outcomes?
Example: Do you track prospective customers’ activity on your website, before they create a log-in/sign-in?
Example: Not tracking which survey/promotion got sent to which customer, to be able to evaluate ROI on Marketing.
Typical Scenario 4: No Data
Or you don't know how to get started?
Recommended Next Steps
We know everyone’s data has problems. How long will it take to fix my (fixable) data issues? That depends… on what’s missing, available to capture/ingest, developer skill set, etc.
What does that actually look like? Using the following steps, you'll end up asking yourself "how can our data answer the business question? " -- not "how clean data should I have before we start making data-driven decisions?"
First, here's my advice for Getting started making data-driven decisions ASAP -- ideally, with your alpha users, but it's not too late to start now.
Check for quality. Even if you have lots of data, I recommend this process:
Evaluate 1 customer’s journey or 1 asset’s touchpoints
Evaluate all of events in small unit of time
Check that everything is captured -- and is as expected?
Try to understand all the workflows in product/platform -- and see if you’re able to track everything
Remember to start small and iterate based on what you learned, working towards real-time data quality monitoring, validating and alerting in place -- in case something happens
Do you need an expert to help you execute these steps, to help your company make better data-driven decisions? Please reach out.
You may also like:
Dear Advisor: Who should be my first data hire?
The Data Quality Flywheel, by Michael Kaminski: https://www.datafold.com/blog/the-data-quality-flywheel/ -- which mentions dbt for evaluating data quality
4 Data Quality Best Practices to Help You Get to Data ROI Faster, by Zoe Hawkins