Dear Advisor: How do I get clean data?

Or: Why it's good that clean data does not exist?

Throughout my career as a data expert, there’s 1 question that comes up again and again (from founders and data professionals):

“I hear data professionals spend 90% of their time doing data cleaning -- how do I get clean data to work with from the very beginning?”

It's a great question! I’m here to tell you -- you can’t -- and that’s actually good thing (!). Here’s why:

Best Case Scenario: Incomplete Data is Expected

It's expected because it reflects the underlying issues that your customers are experiencing -- and that’s a good thing.

  • Example: Customers can’t place orders. If you’re not tracking the customer journey in your app/product, you won’t know if there’s a bug in the sign-in page/checkout process.

  • Example: IoT device got disconnected

In all other cases -- clean data is not a good thing.

Typical Scenario 1: Uninspected Data

You’re tracking seemingly everything, but not checking it as it’s coming in, from the very beginning; so you don't know how clean or complete it is. Jeff Wilke argues that this "uninspected data is always wrong".

  • Example: Did you know that you could spell Philadelphia 57 different ways (when you were applying for the PPP loan)?

  • Example: When I was a management consultant at Slalom, One of our clients was a major retailer who struggled to target their repeat customers -- because they only tracked purchases, not customers.

Typical Scenario 2: (Partially) Incomplete Data

On the flip side, is it possible that you're not tracking all of the product and customer touch-points?

Typical Scenario 3: Biased Data

Or you're only tracking certain (customer) events/outcomes?

  • Example: Do you track prospective customers’ activity on your website, before they create a log-in/sign-in?

  • Example: Not tracking which survey/promotion got sent to which customer, to be able to evaluate ROI on Marketing.

Typical Scenario 4: No Data

Or you don't know how to get started?

Recommended Next Steps

  • We know everyone’s data has problems. How long will it take to fix my (fixable) data issues? That depends… on what’s missing, available to capture/ingest, developer skill set, etc.

  • What does that actually look like? Using the following steps, you'll end up asking yourself "how can our data answer the business question? " -- not "how clean data should I have before we start making data-driven decisions?"

  1. First, here's my advice for Getting started making data-driven decisions ASAP -- ideally, with your alpha users, but it's not too late to start now.

  2. Check for quality. Even if you have lots of data, I recommend this process:

      • Evaluate 1 customer’s journey or 1 asset’s touchpoints

      • Evaluate all of events in small unit of time

      • Check that everything is captured -- and is as expected?

  3. Try to understand all the workflows in product/platform -- and see if you’re able to track everything

  4. Remember to start small and iterate based on what you learned, working towards real-time data quality monitoring, validating and alerting in place -- in case something happens

Do you need an expert to help you execute these steps, to help your company make better data-driven decisions? Please reach out.

You may also like:

References: