Why There's no such thing as clean data

Dear Advisor: How do I get clean data?

June 2021, updated June 2022

Every company has data challenges (no joke!). What that boils down to, is that throughout my career as a data expert, there’s 2 questions that comes up again and again (from founders and data professionals):

“I hear data professionals spend 90% of their time doing data cleaning -- how do I get clean data to work with from the very beginning?” and

"So you have your PhD in Statistics. You're a data expert. Can you we get your stamp of approval that our data's perfect?"

These are great questions! I’m here to disappoint you... because I'm here to tell you... clean/perfect data just doesn't exist -- and that’s actually good thing (!). Here’s why:

Best Case Scenario: Incomplete Data is Expected

It's expected because it reflects the underlying issues that your customers are experiencing -- and that’s a good thing.

  • Example: Customers can’t place orders. If you’re not tracking the customer journey in your app/product, you won’t know if there’s a bug in the sign-in page/checkout process.

  • Example: IoT device got disconnected

In all other cases -- clean data is not a good thing.

Typical Scenario 1: Uninspected Data

You’re tracking seemingly everything, but not checking it as it’s coming in, from the very beginning; so you don't know how clean or complete it is. Jeff Wilke argues that this "uninspected data is always wrong".

  • Example: Did you know that you could spell Philadelphia 57 different ways (when you were applying for the PPP loan)?

  • Example: When I was a management consultant at Slalom, One of our clients was a major retailer who struggled to target their repeat customers -- because they only tracked purchases, not customers.

Typical Scenario 2: (Partially) Incomplete Data

On the flip side, is it possible that you're not tracking all of the product and customer touch-points?

Typical Scenario 3: Biased Data

Or you're only tracking certain (customer) events/outcomes?

  • Example: Do you track prospective customers’ activity on your website, before they create a log-in/sign-in?

  • Example: Not tracking which survey/promotion got sent to which customer, to be able to evaluate ROI on Marketing.

Typical Scenario 4: No Data

Or you don't know how to get started?

Other Reasons for "Weirdness" in Data:

In my experience, the most common reasons for any weirdness in the data may be due to:

  • different teams having different definitions, or

  • things not getting connected correctly, or

  • things not getting synced at the same time, or

  • significant product changes, or

  • users that are a part of multiple A/B tests at the same time.

Recommended Next Steps

  • We know everyone’s data has problems. How long will it take to fix my (fixable) data issues? That depends… on what’s missing, available to capture/ingest, developer skill set, etc.

  • What does that actually look like? Using the following steps, you'll end up asking yourself "how can our data answer the business question? " -- not "how clean data should I have before we start making data-driven decisions?"

  1. First, here's my advice for Getting started making data-driven decisions ASAP -- ideally, with your alpha users, but it's not too late to start now.

  2. Check for quality. Even if you have lots of data, I recommend this process:

      • Evaluate 1 customer’s journey or 1 asset’s touchpoints

      • Evaluate all of events in small unit of time

      • Check that everything is captured -- and is as expected?

  3. Try to understand all the workflows in product/platform -- and see if you’re able to track everything

  4. Remember to start small and iterate based on what you learned, working towards real-time data quality monitoring, validating and alerting in place in case something happens -- and not towards perfectly clean data.

Do you need an expert to help you execute these steps, to help your company make better data-driven decisions? Please reach out.

You may also like: