Dear Advisor: How do I avoid the biggest Data/AI/ML mistakes others make?

Post was originally published in June 2020 and has been updated in August 2020, September 2020, April 2021, and April 2022 for relevancy.

You keep hearing that you need big data, the media tells you that AI can accomplish anything, and vendors tell you that things just work out-of-the-box. You don't know what you don't know -- and you don't have the data and analytics expertise.

I've spent 10 years in the industry solving customer pain points with data and analytics to drive product market fit, and have 4 inventions of novel AI algorithms behind my belt -- inventions that brought in millions of dollars in revenue. I'm here to to tell you that developing data-products with high business impact are not trivial. And developing novel algorithms may take takes years (and a bit of luck!) to develop something that works for a very specific unique use case, when other options fail.

If you're not a data/analytics/AI expert, here are the most common recurring challenges I've seen when it comes to product scope and development of data-driven products.

Not answering the Business Question

Doing data processing or creating predictive models without understanding how the result will solve the business/customer's pain point, including how it will be used by the stakeholder to solve that pain point.
- Assessment: Do you know how the output of the ML model will be used to specifically answer the business/customer's pain point?
- Assessment: Are you trying to do ML without a question that the business needs help answering?
Implementing data warehouse as the first data-related task, without understanding how the data will be used help the business make better, data-driven decisions.
- Assessment: Is there an existing POC deliverable that suggests there's high value in existing data -- and the workflow "just" needs to be more streamlined?

Missing Information -- all of the above and:

It’s impossible to get insights from data you didn’t collect. Are you tracking everything about your business?
- Having mentored, advised and consulted for 50+ companies that are trying/are data-driven, I'd argue that the company should have a (loose) data strategy ASAP -- ideally, with it's first alpha users -- because it becomes exponentially harder and more expensive to catch up.
- Assessment: 5 Aspects of your Business you Need to Track Today to make any Data-Driven Decisions
- Assessment: Do you know how many active users there are on your platform today?
- Assessment: What is your product market fit baseline?
(On the flip side) Tracking virtually everything about the business, but with many different data providers and vendors, that don't talk to one another.
- Assessment: Do you know how many new active users -- there are on your platform today?

Too early for ML -- all of the above and:

Starting with ML/AI (e.g. predictive analytics) to automate insights/forecasts, before understanding what's happening with customers and product historically and now (e.g. descriptive analytics).
- Assessment: Is a Data Scientist one of your first technical hires?
- Assessment: Do you currently know who cancelled your service within the last week, and how you acquired that customer in the first place?

2. Your MVP is under development.

Assessment: Dear Advisor: What should (not) be your AI roadmap? (or) Why You Don't Need AI in your SaaS MVP

Treating ML/analytics as a Silver Bullet -- all of the above and:

Doing analytics for the sake of analytics, without understanding how the customer actions generated the data you see -- and how the deliverable will solve the business/customer's pain point.

- Assessment: Are you trying to do ML for a process that you don't understand?

Inadvertently scoping out MVP for product/feature to be better than state-of-the-art for ML.

- Example: Many pitch decks that end with "... and we'll use AI to do this" :(
Buying a multi-year software vendor license without doing an internal POC to see if the software actually solves the problem you bought it for.
- Assessment: What's not working now? What does it need to have for in short-term? long-term?

Executing on ML products as if they're software engineering tasks -- all of the above and:

Thinking you have clean data :)
- Example: 57 ways to spell Philadelphia in the Paycheck Protection Program
- Why there's no such thing as clean data
Assuming that the data won't change :(
- For possible causes of changes in the data, please see my blog post: ML in Production is doing WORSE than locally
Developing biased models because of biased data or 4 other reasons -- or the 21 (sometimes contradictory) definitions of fairness.
- Example: IBM, Amazon and Microsoft stop selling "general purpose" facial recognition (2020)
- Example: Gender-recognition technology and its bias (2019)
- Example: Amazon's AI recruiting tool showed bias against women (2018)
Not treating ML as data products that you scope down and iterate over, from proof-of-concept (POC) to v1, v2, etc.
- Assessment: Do you have an ad-hoc/simple model that answers your business question, that you can compare + evaluate the next iteration of the ML model against?
- Recommendation: For each POC, time-box data the exploration stage to help you scope down (and scope out) the next phase.
Asking for a guarantee on ML model performance (based on ML metrics, KPI, etc.):
- Guarantee that offline model performance will be at least X -- which is impossible to guarantee because it depends on data quality
- Guarantee that live model performance will be just as good or better, than the offline model -- which is also impossible to guarantee, because performance in production it depends on many things staying the same, or
- Guarantee that a live model will never need to be updated

6. Not knowing about the Hidden Technical Debt in Machine Learning Systems

Assessment: Do you set aside time to tackle software engineering and machine learning debt?

7. Not understanding what the algorithm is doing.

Assessment (start-up): Can you give a 1-2 sentence overview of what the algorithm is doing?
Assessment (data scientist): Can you give a 1-2 sentence non-technical overview of what the algorithm is doing? Why you picked that one? And why you chose the parameters you did?

Not executing on (aspects of) ML products as if they're software engineering tasks -- all of the above and:

Not testing and monitoring ML products in production
- Assessment: How high do you score on the ML Test?

Difficulty hiring data professionals

Who should be my first data hire? to help you:

Align job title and description of requirements, and/or
Align job title/description with how the role actually fills the needs of the team, including listing software packages over a 30/60/90 plan of what the expectations and deliverables look like.

Nobody is perfect, has clean data, or has ML running with no downtime in production. Now that you know what to focus on, start small and iterate.

Do you need an expert to help you improve your product market fit and scale by leveraging data to make your customers happier? Please reach out.

Keywords: AI, ML, start-ups, data strategy, data products, customer understanding

This blog post was originally based on an office hour I hosted on the 805 Startups Discord server on June 15th, 2020.