Six Steps To Check And Fix Data Quality

Zehra Cataltepe is the CEO of TAZI.AI an adaptive, explainable Machine Learning platform. She has more than 100 papers and patents on ML.

If you often have the thought, “I don’t think that my data is good enough,” these six key findings based on my experience can help:

1. Task And Urgency

First of all, data quality should always be measured in relation to the task at hand. What do you want to do with that data? Do you want to create a model to predict the purchasing or retention behavior of your customers? Do you need to determine where you are losing good clients to the competition? Do you want to predict the benefits of your investment in a new external dataset?

What kind of data is required and at what level of quality depends on the problem you are trying to solve. Do not think of the label definition as carved in stone; it will evolve as you look deeper into your data and solutions.

How urgent is the task? If the variable that has a quality problem is essential for a model that you use daily, then urgency is clear. If you have a machine learning model and you measure not only machine learning metrics but also the business benefits of the model, then you can measure monetary impact when the variable has a quality problem. The more dollars, the more urgent the task.

2. Continuous Monitoring

Data quality and behavior are not constant. It varies over time for each variable. It also varies across your customer or product portfolio. You should continuously measure data quality in both time and space.

3. Remedy At Hand

You need quick remedies, a data bridge of sorts when some portion of data is not good enough, and you have a downstream automated process that needs the data. 

The remedy could be using sensor replacement, using ensembles of local models such as boosted decision trees, using rules or asking for human intervention. I have seen many creative solutions to fill in the missing values that only a domain expert would consider.

4. Helping Business Experts With Better Data

Whatever the quality level of your data, the business experts in your organization are already making decisions based on that data. So, if you don’t think that your data is good enough, you need to quickly identify exactly where and when so that you can notify those users of the limitations of their data.

5. People Who Own The Data

It is not really about data, but it is about the people. In my opinion, organizations that value their people, partners and vendors have better data than others. Because supplying and keeping good data is not a luxury but a responsibility in today’s data-driven world.

People understand and improve upon these responsibilities better if they own the business as opposed to only working there. Continuously documenting your data quality and practice is a good idea, especially if the people who manage the data quality have a high turnover.

6. Understand The Current Human Expert Decision Making With Just OK Or Bad Data

Finally, when expert business people make decisions, they look at the data and make corrections on the data and/or the models in their heads and then make decisions. Experts may change either data or their decision-making if there are specific data quality problems.

Data quality checks and fix mechanisms and analytics/machine learning models can benefit a lot from exactly how human experts behave under different data-quality scenarios. If data quality and machine learning models are presented to human domain experts continuously—through understandable, accessible and interactive interfaces—then human experts may be able to articulate and document their decisions under low data quality scenarios.

Conclusion

There are a lot of different data quality problems and many ways to handle them. It’s important to continuously learn about the data quality issues you face constantly and solutions that work or don’t.

PS: I would like to share some technical details here to complete the story.

Some of the data quality measures—on-time availability or missing or malformed values for each data point (also called feature, variable, input)—are commonly known and are independent of the problem you are solving.

On the other hand, understanding the relevance of data—i.e., the predictive power of data for the specific business problem—is a bit more involved. To measure relevance, you will need an expression of the business problem using a target (also called label, outcome or output), a classification/regression label (such as churn or customer lifetime value) as well as an understanding of the detection/prediction/prescription problem (such as how customers churned, who will churn next month and how to prevent churn).

When measuring relevance, use a nonlinear measure, such as mutual information. Linear correlation cannot capture important variables that nonlinearly affect your label. Prediction or prescription problems may require more or higher quality data than detection problems because instead of predicting the future or what to do to change it, you are trying to understand how and why the label is happening in detection mode.

Zehra Cataltepe

Forbes Councils Member

Forbes Technology Council

Also Published on Forbes: https://www.forbes.com/sites/forbestechcouncil/2023/02/08/six-steps-to-check-and-fix-data-quality/

Take the next step with TAZI

30 minutes: What it is, How it works, and How to get started.
Unleash the Power of TAZI.AI: Request Your Demo Today!