Mehmet Simsek
5 min readNov 19, 2020

--

Dealing with Missing Data: Practical Imputation Methods

According to different resources, data scientists spend 80% of their time cleaning data rather than creating insights. Here are some reports that confirm this argument.

According to Crowdflower 2015 report, although there is no info about the time spent on it, 66.7% of the participants stated that data cleaning/organizing data is one of their most time-consuming tasks.

In 2016 report of the same organization, the answer for the question “What data scientists spend the most time doing?” was data cleaning, ranking 1st with 60%, and in 2017 the number was 51%.

Time Spent on Activities in Data Science

Finally according to Kaggle 2018 report, data cleaning and organizing tasks consume around 80% of the time on a given project.

Dealing with missing data constitutes the main task in data cleaning process. Python provides a great library, Pandas, for data manipulation. Pandas is a great data cleaning tool for finding, replacing or removing missing (or null) values in a dataframe. All of these methods can be used in different circumstances, however sometimes removing missing data can affect our machine learning algorithms negatively. Therefore caution is needed and several things should be taken into consideration, such as number of missing values, how big is the dataset etc., when dealing with missing values.

Want to read this story later? Save it in Journal.

Since it is the easiest way, there is a tendency to omit missing values in data cleaning process. However, instead of removing missing data, replacing/filling them with proper values can be a better option. Nevertheless, finding proper values for this action is a common challenge for data scientists.

Pandas library provides us with a variety of options for filling missing values. For the dataset below, we will use different fillna methods in accordance with the characteristics of respective columns.

Let’s have a look at our dataset.

As seen above, our dataset has some missing values. Let’s use visualization to analyze the dataset better.

Some part of the clylinders, gears, upholstery and km columns are colored in yellow, representing missing values. Since some of them are categorical and others are numerical data, we can’t use mean imputation for filling missing values for all columns. Different methods are required, considering characteristics of the columns. Here we will use mean, mode and specific values (with limit) to fill missing values.

1. Using mean():

Mean of a dataframe/column is a common way when replacing missing values. In our sample dataframe, we will use the mean of the km column together with groupby and transform methods.

Here groupby help us to calculate means of the cars according to their ages. Then transform enables us to fill missing km values with the mean of a car’s respective age group.

2. Using mode():

When median or mean cannot be used, mode is especially useful for filling categorical missing values. In this method, missing values is replaced by the most common value in the dataframe/column.

3. Filling with a specific value (with limit):

Sometimes due to the characteristics of the data, above methods can’t be applied. There can be several repeating values in our dataset and we can’t use aggregate methods or mode to replace NaN values. Hence, we can fill missing values considering the proportion of valid values at hand.

The number of missing values in the upholstery column is 465.

Rest of the 1534 rows consist of either “leather” (1280) or “cloth” (254) values. So as a proportion, around 80% of non-missing values in this column are “leather” and the rest are “cloth” (20%).

Taking this into consideration, we can replace missing values in this column by setting up a certain limit and using proportions of the number of “leather”/”cloth” values, in our case 4/1. In other words, around 370 of the 465 missing values should be filled with “leather” and 95 of them with “cloth”.

So to recap:

-Explore your dataset: Spend some time analyzing your dataset and with the help of columns, which don’t have missing values, try to segment your data to create useful patterns for calculations. Make use of groupby and transform functions in this regard.

-For the next step, choose the best method to fill your data, considering its affects on the desired outcome.

Do you want to learn imputation methods in SQL Server? If so, follow me for the next story.

More from Journal

There are many Black creators doing incredible work in Tech. This collection of resources shines a light on some of us:

--

--