Data Quality For Business: Metrics To Measure And Understand
Author: Zehra Cataltepe, CEO of TAZI.AI
In many industries, including banking, insurance and telecommunications, in every AI/GenAI conversation, data quality is often cited as one of the most significant challenges. The other two are ROI and security/privacy. With frequent changes in the market and M&As, ensuring that data remains accurate and consistent has become even more difficult. Data science and analytics teams typically spend a large portion of their time cleaning and preparing data, but business teams may not fully understand how that affects the value they get out of data.
This article introduces a rapid method for evaluating data quality that business teams can easily use. We will review the key metrics of data quality and how to measure them using two methodologies (or tools) data profiling and rapid AI (GenAI) model prototyping. If you are in the data quality business, data profiling won’t be a surprise for you, but the connection between data quality and rapid AI prototyping might.
Key Metrics Of Data Quality
The most widely accepted metrics for evaluating data quality are accuracy, completeness, consistency, timeliness, uniqueness and validity. Here’s how to measure each:
1. Accuracy
How close is your data to reality? When people enter data or data comes through different continuously updated IT systems, data may (and often does) have errors, which may affect your analytics dashboards or results of your predictive AI solutions.
Using Data Profiling: Perform statistical checks (like minimum, maximum, mean, standard deviation, skewness and visual distribution plots) to identify anomalies or outliers that indicate inaccurate data.
Data Accuracy Models: Quickly create AI models utilizing rapid AI prototyping in a few steps. These models will flag data as inaccurate if the predictions do not match the expected outcomes based on other data columns. For example, if you have called people and offered discounts in the past, you would expect to see a reduction in churn. If you made more calls, you would expect to see an increase in the discounts. AI models’ explanations discover such dependencies involving a number of data columns together. You can check these against domain knowledge.
2. Completeness
Is all necessary data available? When data is only partially available, some of your dashboards or models will simply not work or give inaccurate results. Usually, a null is easier to catch than an inaccurate unless you have automated imputation and are not checking for incomplete data.
Using Data Profiler: Check how many rows in your dataset have missing values (empty cells) and compare them to the total number of rows.
Imputation Models: You can create models to predict outcomes even when some data is missing. Using the data accuracy models, you can impute (i.e., fill in) the missing values. You can also create missing data models so that you understand where data is missing exactly, how you can prevent it and what imputation is needed. You should also model the errors between predicted and actual values of the data accuracy models because inaccurate data is worse than missing data.
3. Consistency
Are there conflicting values across datasets? While your customer’s data moves between departments, it can be challenging to keep track of consistent updates. Understanding consistency and timeliness issues may help you fix your data architecture.
Internal Consistency Check: Within a single dataset, aggregate data by unique identifiers (e.g., customer IDs) and verify consistency in the values. You can use data profiler results to check for distributional consistency. You can utilize feature engineering or data drift detection methods to check for variations in statistics (min, max, mean or mode).
Cross-System Consistency: Combine data from different systems using unique identifiers, then check for mismatches between the two datasets. Note that the mismatch could have different degrees, from corrupted to just delayed data.
4. Timeliness
Is the data up-to-date and available when needed? Solutions like fraud detection, customer experience or communications require timely data and must tolerate MS delays for relevant results.
Using Data Profiler: Utilize update timestamps to calculate the delay between data generation and availability (latency). The same tools, including the AI tools, used for consistency can also help measure timeliness, which is being consistent in time.
5. Uniqueness
Are there duplicate records where unique identifiers are expected? Especially in transactional systems where customers or steps are contacted and audited several times, data gets duplicated, e.g., thinking that you have two customers when there is only one can lead to many downstream problems.
Using Data Profiler: Count the number of unique values in columns that should have no duplicates (e.g., customer IDs). Data profiler can highlight duplicates, ensuring unique primary keys.
Generative AI: This can be used on top of basic matching and distance-based systems as a means to de-duplicate complex records, such as names or addresses. You can describe your preferences via prompts in your rapid GenAI model prototyping tool.
6. Validity
Does the data follow the required formats and constraints? In addition to dates, names, data types (e.g., if you see only 0s or 1s, is it a flag, string, integer or real) can also be handled differently across systems, affecting accuracy and relevance.
Basic Validity Check: A data profiler can measure how many data points do not follow the expected format (e.g., invalid dates, emails or phone numbers) or fall outside expected ranges.
Advanced Validity Check: You can use data drift detection to see where the data may have shifted over time, signaling potential issues in validity. This helps to identify mismatches between data in production and training.
Conclusion
We have outlined how data quality can be measured and improved using both data profiling and rapid AI/GenAI prototyping tools. These tools allow business and data teams to work together to assess and enhance the quality of their data efficiently. In Part 2 we will see how to prioritize data quality tasks based on the business value attached to them.
Forbes Councils Member
Also Published on Forbes: https://www.forbes.com/councils/forbestechcouncil/2024/10/08/data-quality-for-business-metrics-to-measure-and-understand/
Take the next step with TAZI
30 minutes: What it is, How it works, and How to get started. Unleash the Power of TAZI.AI: Request Your Demo Today!