Missing Value Handling in Machine Learning
Machine learning is the art of taking known facts, that is, data, and transforming those facts into a prediction, figuring out something unknown. In this specific case, transforming the customer data of American Express, a credit card company, into a prediction of which customers will default on their loans. This data, however, is plagued with hundreds of thousands of missing values; a variable denoting whether a customer has defaulted in the past, for example, being represented in data by either a one or no value at all. The goal of this research is to determine a method to utilize the vast amount of missing data points to improve a machine learning model’s accuracy, rather than setting them aside as unworkable. Using a small subset of American Express’s data, I have identified the XGBoost model, a publicly available machine learning library, as an effective tool, one capable of bypassing the missing values when making a prediction without skewing the weight of the actual values in a column fraught with said missing values. Future research seeks to combine XGBoost’s capabilities with feature engineering on select under-performing variables.
Leave a Reply