Ryan Hartz

Missing Value Handling in Machine Learning

Machine learning is the art of taking known facts, that is, data, and transforming those facts into a prediction, figuring out something unknown. In this specific case, transforming the customer data of American Express, a credit card company, into a prediction of which customers will default on their loans. This data, however, is plagued with hundreds of thousands of missing values; a variable denoting whether a customer has defaulted in the past, for example, being represented in data by either a one or no value at all. The goal of this research is to determine a method to utilize the vast amount of missing data points to improve a machine learning model’s accuracy, rather than setting them aside as unworkable. Using a small subset of American Express’s data, I have identified the XGBoost model, a publicly available machine learning library, as an effective tool, one capable of bypassing the missing values when making a prediction without skewing the weight of the actual values in a column fraught with said missing values. Future research seeks to combine XGBoost’s capabilities with feature engineering on select under-performing variables.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: