How To Handle Data And Machine Learning Bias In Production
Zehra Cataltepe is the CEO of TAZI.AI an adaptive, explainable Machine Learning platform. She has more than 100 papers and patents on ML.
Do you need to determine if there is bias in your dataset toward, for example, gender or race? Do you need to make sure that the machine learning models you use are unbiased, even if your data is biased? If you answered “yes” to either of these questions, this article is for you.
Introduction To Bias
Bias refers to a conscious or unconscious preference toward a particular group, often to the exclusion of others. For people within certain racial, ethnic, gender, ability and religious groups, bias results in discrimination and systemic barriers to opportunities and success. Data created in a biased world is inherently biased. Creating and deploying machine learning (ML) models always come with a significant risk of bias. Because of this, ML solution environments should provide human-usable explanations to detect and remedy bias.
Accountability and accessibility are essential in handling bias. Accountability is needed to make sure that whoever notices bias does something about it. Accessibility of ML systems in production allows bias to be handled in a timely manner. Accountability can be partially addressed by audit logs. Lowering the entry barrier to ML by automatization and easy-to-use UI/UX can help with accessibility.
In this article, I’ll explain below how data-related bias and model-related bias can be detected and handled via systematic explanations of data and ML models. Data-related bias is defined as the bias that already exists in the dataset. For example, in a customer churn prediction use-case, 90% of the dataset could contain white customers, leading to racial bias in the dataset. Model-related bias is defined as the bias that is produced within the model. In this case, since white people make up 90% of the population, the model that aims to minimize the error would predict churn better for white people, resulting in race bias in the model. Using this model to take actions to prevent churn would underserve non-white populations.
Data Bias Detection
The first and most common type of data-related bias happens when some variable values occur more frequently than others in a dataset (representation bias). For example, for a clinical trial, 90% of the participants could be males.
Representation bias can be partially handled by resampling the data to represent different groups equally. However, when there is less information and details for the underrepresented groups, the ML model may be learning them less.
Data-related bias also occurs when there are highly correlated variables with the target feature. In order to detect bias according to certain sensitive features, feature relevances (i.e., the correlation of each column with respect to the target feature) can be calculated. The user can ignore highly relevant sensitive features, such as gender or age, that might lead to bias. Please note that linear correlation measures may not work for datasets containing both discrete and continuous features or nonlinear correlations. Using normalized mutual information can help with this.
Even when the sensitive and relevant features are removed, there may be other features correlated with those sensitive features. For example, zip codes may be highly correlated with race. Even if race is removed from model building, keeping the zip code may still cause biased models. Clustering or grouping variables based on their correlations with each other may help detect and remove such correlated features. Another way to detect complex data bias is through the creation of an ML model for each sensitive feature. The features that contribute most to the prediction of these sensitive features should be ignored in the ML models.
Machine Learning Model Bias Detection
For model-related bias, consider both the inputs to the ML model and the output predictions of the model. When the dataset is unbalanced, sensitive features might be too relevant to the target feature and cause bias. Some ML platforms assign automated class weights during model building to emphasize underrepresented classes.
Machine learning model explanations also help with the detection and prevention of model-related bias. There are local or global feature importances, such as SHAP or LIME, providing information on how each feature’s value affects the model outcome. For example, if increased age results in lower credit score predictions, then the model has age-related bias. However, it is difficult to determine exactly where in the model the bias is. Use easily interpreted surrogate model explanations, such as linear models or decision trees. Surrogate models approximate and explain the underlying ML model used for decision making. They allow more granular detection of bias. A decision tree surrogate model contains automatically generated micro-segments of model prediction, each resembling a rule (e.g., if the agency type is silver and gender is male, then the customer will churn.)
When bias (or any other problem) is detected on an ML model, the ease and speed of actions determine how fast it is resolved. Creating and sharing data and model explanations can help users take faster action.
Conclusion
The systematic detection and prevention of bias in data and machine learning models is possible. Hiring users from diverse backgrounds and AI-enabling them will allow them not only to facilitate better detection and prevention of bias. It also helps remedy instances when bias detection systems or ML models fail or are hacked.
Forbes Councils Member
Also Published on Forbes:
Take the next step with TAZI
30 minutes: What it is, How it works, and How to get started. Unleash the Power of TAZI.AI: Request Your Demo Today!