MISSING DATA IN PANDAS
In this tutorial, we are going to learn about the missing data in pandas. Since, it can be a bit twisting to deal with non-existing data, so pandas has its own way of dealing or eliminating the missing values.
What is Missing Data in Pandas?
Sometimes, you may receive data in bulk which may include missing values or unknown values in rows or columns. Handling missing values could be a major task in pandas as you have to necessarily deal with it before applying any algorithm to machine learning otherwise your code won’t execute properly. So, in order to eliminate the risk of running a bad code, let’s learn two different ways of dealing with the missing or unknown values in Pandas:
- dropna() method
- fillna() method
dropna() Method: Missing Data in Pandas
Let’s work with a dataset called titanic which you can find here. Now, let’s import the csv file in order to catch missing values or Nan values.
Note: NaN values in python stands for missing numerical data, the other representation of NaN is Not a Number. You can also find datasets with values that have None or Null in them, it simply means that the cell or container is empty or has no value at all.
import pandas as pd df = pd.read_csv('train.csv') df.head()
You can see the NaN values in highlights. Let’s look at the shape of the dataset:
In this particular dataset, we have to deal with a lot of NaN values. So let’s learn how to drop such NaN values and clean our dataset.
This method will drop the rows which have NaN values. So the output will be:
Let’s look at the shape of dataframe after dropping NaN values:
If you want to remove NaN values via columns then you can select the axis set to 1:
Above, you can see that all the columns that had missing values (NaN) are dropped. This is how you can remove or drop the NaN values from your dataset.
Note: dropna() will drop the values temporarily unless you use the inplace argument as True to make permanent changes.
fillna() Method: Missing Data in Pandas
Now, let’s look at how you can work around missing values without deleting whole rows and columns by filling the voids. You can do so by using the fillna() method.
You can see that the missing values have been replaced or filled by zeros. Hence, it’s not empty anymore. But sometimes we do come across data that doesn’t have to be always in numbers, hence we need to fill our missing values by strings as well. So in order to do that, I can simply put a string inside the fillna() method:
You can see that the missing value has been replaced with a string “Not Known”, However, this might not be an efficient way of filling missing values as we may encounter a dataframe where we have to replace the missing values by both an integer and a string, so we have to select a particualyr column to eliminate the confusion.
df['Age'].fillna(0, inplace=True) df
As you can see above, the missing values in the Age column have been replaced with 0. Similarly, to change the missing values as a string, we can apply the same method with Cabin column as well:
df['Cabin'].fillna("Not Known", inplace=True) df