Sunday , September 25 2022
Home / Data Science / Data Cleaning / Introduction To Data Cleaning Using Python Pandas Library. Learn How To Do Data Preprocessing Before Data Analysis.

Introduction To Data Cleaning Using Python Pandas Library. Learn How To Do Data Preprocessing Before Data Analysis.

1 – Handle Missing Data

Import Library

In [2]:
import numpy as np
import pandas as pd

Read CSV File

In [3]:
data = pd.read_csv('weather_data.csv')
data
Out[3]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Check Rows and Columns

In [4]:
data.shape
Out[4]:
(7, 4)

Data has 7 Rows and Columns

Check Data Type of Each Column

In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
day            7 non-null object
temperature    7 non-null int64
windspeed      7 non-null int64
event          7 non-null object
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes

Replacing Single Value with NaN

In [6]:
data
Out[6]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Replace -99999 to NaN

In [7]:
data = data.replace(-99999,value=np.NaN)
data
Out[7]:
day temperature windspeed event
0 1/1/2017 32.0 6.0 Rain
1 1/2/2017 NaN 7.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 7.0 0
4 1/5/2017 32.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
day            7 non-null object
temperature    5 non-null float64
windspeed      5 non-null float64
event          7 non-null object
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes

Now we have 2 missing values in temperature and windspeed column respectively

Replacing List of Values with Single Value

In [9]:
data
Out[9]:
day temperature windspeed event
0 1/1/2017 32.0 6.0 Rain
1 1/2/2017 NaN 7.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 7.0 0
4 1/5/2017 32.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [10]:
data = data.replace([32.0,7.0],value=99)
data
Out[10]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0

Replacing Per Column

In [11]:
data
Out[11]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0

Replace Temperature Column value 99, Windspeed Missing Value (NaN) and Event value 0 with value 100

In [12]:
data.replace({'temperature':99.0,'windspeed':np.nan,'event':'0'},100)
Out[12]:
day temperature windspeed event
0 1/1/2017 100.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 100.0 Snow
3 1/4/2017 NaN 99.0 100
4 1/5/2017 100.0 100.0 Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 100

Replacing by using Mapping

In [13]:
data
Out[13]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [14]:
data = data.replace({np.nan:69,'0':'Sunny'})
data
Out[14]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 69.0 99.0 Sunny
2 1/3/2017 28.0 69.0 Snow
3 1/4/2017 69.0 99.0 Sunny
4 1/5/2017 99.0 69.0 Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 Sunny

Advance Data Handling Techniques using Regular Expressions

Read Weather Data

In [15]:
data = pd.read_csv('weather2.csv')
data
Out[15]:
day temperature windspeed event
0 1/1/2017 32 F 6 mph Rain
1 1/2/2017 -99999 F 7 mph Sunny
2 1/3/2017 28 F -99999 mph Snow
3 1/4/2017 -99999 F 7 mph 0
4 1/5/2017 32 F -99999 mph Rain
5 1/6/2017 31 F 2 mph Sunny
6 1/6/2017 34 F 5 mph 0

Remove mph from windspeed & F from Temperature

In [16]:
data = data.replace({'temperature':'[A-Za-z]','windspeed':'[A-Za-z]'},'',regex=True)
data
Out[16]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Replacing Column Values with Another List of Values

In [17]:
d = {'score':['exceptional','average','good','poor','average','exceptional'],
  'student':['Karan','Arpit','Varun','Robin','Akshay','Ankush']}
In [18]:
data = pd.DataFrame(d)
data
Out[18]:
score student
0 exceptional Karan
1 average Arpit
2 good Varun
3 poor Robin
4 average Akshay
5 exceptional Ankush

Replace the value of Score Column with values 1,2,3,4 Respectively

In [19]:
data = data.replace(['poor','average','good','exceptional'],[1,2,3,4])
data
Out[19]:
score student
0 4 Karan
1 2 Arpit
2 3 Varun
3 1 Robin
4 2 Akshay
5 4 Ankush

About Machine Learning

Check Also

Combining and Merging in Pandas - Data Science Tutorials

Combining and Merging in Pandas – Data Science Tutorials

13- Combining and Merging Combining and Merging in Pandas¶The datasets you want to analyze can …

Leave a Reply

Your email address will not be published.