Missing Data In Pandas¶
- Real-world data is dirty. It is important to preprocess the data before analyzing the data.
In [1]:
#importing libraries
import pandas as pd
import numpy as np
Note: Missing data in the Pandas is represented by the value NaN (Not a Number)
In [2]:
s=pd.Series(["A",np.nan,"C","B",np.nan,"E",np.nan,np.nan,"D"])
s
Out[2]:
0 A 1 NaN 2 C 3 B 4 NaN 5 E 6 NaN 7 NaN 8 D dtype: object
To check the missing values in the data we use isnull() method.¶
In [3]:
s.isnull()
Out[3]:
0 False 1 True 2 False 3 False 4 True 5 False 6 True 7 True 8 False dtype: bool
In [4]:
s[s.isnull()]
Out[4]:
1 NaN 4 NaN 6 NaN 7 NaN dtype: object
Similarly we can also check the data which is not null using notnull() method.¶
In [5]:
s.notnull()
Out[5]:
0 True 1 False 2 True 3 True 4 False 5 True 6 False 7 False 8 True dtype: bool
In [6]:
s[s.notnull()]
Out[6]:
0 A 2 C 3 B 5 E 8 D dtype: object
If you want to remove the missing data, you can use the dropna method.¶
In [7]:
s.dropna()
Out[7]:
0 A 2 C 3 B 5 E 8 D dtype: object
2. Handling Missing data in DataFrame¶
In [8]:
df=pd.DataFrame([[12,32,13,55,78,54],[42,np.nan,54,np.nan,55,855],[np.nan,64,np.nan,45,np.nan,78],[55,447,552,np.nan,85,889]])
df
Out[8]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
1 | 42.0 | NaN | 54.0 | NaN | 55.0 | 855 |
2 | NaN | 64.0 | NaN | 45.0 | NaN | 78 |
3 | 55.0 | 447.0 | 552.0 | NaN | 85.0 | 889 |
By default, the dropna method removes rows with missing data.¶
In [9]:
df.dropna()
Out[9]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
Now, Let’s remove the columns with missing data.¶
In [10]:
df.dropna(axis=1)
Out[10]:
5 | |
---|---|
0 | 54 |
1 | 855 |
2 | 78 |
3 | 889 |
If you want to assign another value instead of missing data, you can use the fillna method.¶
In [11]:
df.fillna(0)
Out[11]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
1 | 42.0 | 0.0 | 54.0 | 0.0 | 55.0 | 855 |
2 | 0.0 | 64.0 | 0.0 | 45.0 | 0.0 | 78 |
3 | 55.0 | 447.0 | 552.0 | 0.0 | 85.0 | 889 |
Using the dictionary structure, you can assign the missing data in each column to a different value with the fillna method. Let me show that.¶
In [12]:
df.fillna({0:15,1:25,2:35,3:48,4:66,5:56})
Out[12]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
1 | 42.0 | 25.0 | 54.0 | 48.0 | 55.0 | 855 |
2 | 15.0 | 64.0 | 35.0 | 45.0 | 66.0 | 78 |
3 | 55.0 | 447.0 | 552.0 | 48.0 | 85.0 | 889 |
If you want to assign the value in the upper row to the missing data, you can use the method = “ffill” or “bfill”¶
Forward Fill¶
In [13]:
df.fillna(method="ffill")
Out[13]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
1 | 42.0 | 32.0 | 54.0 | 55.0 | 55.0 | 855 |
2 | 42.0 | 64.0 | 54.0 | 45.0 | 55.0 | 78 |
3 | 55.0 | 447.0 | 552.0 | 45.0 | 85.0 | 889 |
Backward Fill¶
In [14]:
df.fillna(method="bfill")
Out[14]:
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 12.0 | 32.0 | 13.0 | 55.0 | 78.0 | 54 |
1 | 42.0 | 64.0 | 54.0 | 45.0 | 55.0 | 855 |
2 | 55.0 | 64.0 | 552.0 | 45.0 | 85.0 | 78 |
3 | 55.0 | 447.0 | 552.0 | NaN | 85.0 | 889 |