Data Transformation in Pandas¶
In [1]:
#import libraries
import pandas as pd
import numpy as np
Finding Duplicate Values¶
In [2]:
data=pd.DataFrame({"a":[1,5]*3,"b":[1,1,2,3,2,3]})
data
Out[2]:
a | b | |
---|---|---|
0 | 1 | 1 |
1 | 5 | 1 |
2 | 1 | 2 |
3 | 5 | 3 |
4 | 1 | 2 |
5 | 5 | 3 |
In [3]:
#To see whether the row is repeated or not, you can use the duplicated method.
data.duplicated()
Out[3]:
0 False 1 False 2 False 3 False 4 True 5 True dtype: bool
In [4]:
#To remove duplicate rows, you can use the drop_duplicates method which returns a data frame.
data.drop_duplicates()
Out[4]:
a | b | |
---|---|---|
0 | 1 | 1 |
1 | 5 | 1 |
2 | 1 | 2 |
3 | 5 | 3 |
Mapping¶
In [5]:
import pandas as pd
data={'Name':['A','B','C','D'],'Age':[42,54,20,63]}
df=pd.DataFrame(data,index=['value1','value2','value3','value4'])
df
Out[5]:
Name | Age | |
---|---|---|
value1 | A | 42 |
value2 | B | 54 |
value3 | C | 20 |
value4 | D | 63 |
Now, lets add a new column to show score
In [6]:
score={'A':24,'B':30,'C':36,'D':42}
# Converting the first character of the name to uppercase so that the names are same.
n=df["Name"].str.capitalize()
n
Out[6]:
value1 A value2 B value3 C value4 D Name: Name, dtype: object
In [7]:
#Now, adding the score variable to df dataset using map() method.
df["Score"]=n.map(score)
df
Out[7]:
Name | Age | Score | |
---|---|---|---|
value1 | A | 42 | 24 |
value2 | B | 54 | 30 |
value3 | C | 20 | 36 |
value4 | D | 63 | 42 |
Replacing¶
Rplacing is a method use to replace the values in the dataframe.
In [8]:
s=pd.Series([14,55,87,966,78,44])
s
Out[8]:
0 14 1 55 2 87 3 966 4 78 5 44 dtype: int64
In [9]:
#replacing the value.
s.replace(87,np.nan)
Out[9]:
0 14.0 1 55.0 2 NaN 3 966.0 4 78.0 5 44.0 dtype: float64
In [10]:
#replacing multiple values.
s.replace([78,14],[np.nan,566])
Out[10]:
0 566.0 1 55.0 2 87.0 3 966.0 4 NaN 5 44.0 dtype: float64
Renaming¶
In [11]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(12).reshape(3,4),index=[0,1,2],columns=["a","b","c","d"])
df
Out[11]:
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
In [12]:
#Create a variable and pass it into the dataset
s=pd.Series(["one","two","three"])
df.index=df.index.map(s)
df
Out[12]:
a | b | c | d | |
---|---|---|---|---|
one | 0 | 1 | 2 | 3 |
two | 4 | 5 | 6 | 7 |
three | 8 | 9 | 10 | 11 |
In [13]:
#capitalize the row and column names with the rename method
df.rename(index=str.title,columns=str.upper)
Out[13]:
A | B | C | D | |
---|---|---|---|---|
One | 0 | 1 | 2 | 3 |
Two | 4 | 5 | 6 | 7 |
Three | 8 | 9 | 10 | 11 |
we can also change the row or column names using the dictionary structure with the rename method.
In [14]:
df.rename(index={"one":"10"},columns={"d":"f"},inplace=True)
df
Out[14]:
a | b | c | f | |
---|---|---|---|---|
10 | 0 | 1 | 2 | 3 |
two | 4 | 5 | 6 | 7 |
three | 8 | 9 | 10 | 11 |
Finding Specific value in a dataset¶
In [15]:
data=pd.DataFrame(np.random.randn(1000,4))
data
Out[15]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.095845 | -0.579053 | 0.021126 | -0.615226 |
1 | 1.265209 | 1.606119 | -0.167518 | 0.022493 |
2 | 0.260408 | 0.015183 | 0.986117 | 0.203856 |
3 | -0.805561 | 0.224160 | 0.992170 | -1.046847 |
4 | -0.257810 | 0.598237 | 1.087677 | 1.733996 |
… | … | … | … | … |
995 | 0.172482 | 0.239283 | -0.460610 | 0.383162 |
996 | -0.012413 | 1.782691 | 0.706961 | -1.101175 |
997 | -0.420072 | -0.071259 | -1.346796 | 0.795517 |
998 | -0.151630 | 0.788102 | 1.759462 | 0.306519 |
999 | -2.626494 | 1.541084 | 0.057281 | 0.153461 |
1000 rows × 4 columns
In [16]:
data.head()
Out[16]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.095845 | -0.579053 | 0.021126 | -0.615226 |
1 | 1.265209 | 1.606119 | -0.167518 | 0.022493 |
2 | 0.260408 | 0.015183 | 0.986117 | 0.203856 |
3 | -0.805561 | 0.224160 | 0.992170 | -1.046847 |
4 | -0.257810 | 0.598237 | 1.087677 | 1.733996 |
In [17]:
#using the describe method to see summary statistics.
data.describe()
Out[17]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.032380 | -0.003029 | 0.074722 | 0.029196 |
std | 0.991572 | 0.972940 | 0.937606 | 0.956663 |
min | -3.541948 | -3.488546 | -2.799998 | -3.595273 |
25% | -0.663346 | -0.640519 | -0.556768 | -0.658451 |
50% | -0.026216 | -0.038468 | 0.085684 | 0.041386 |
75% | 0.583050 | 0.647501 | 0.676943 | 0.691507 |
max | 3.232359 | 2.951305 | 3.066394 | 3.040598 |
let’s find values whose absolute value exceeds 3 in a column with 1 index.
In [18]:
#assigning the column with the 1st index of the data to the col variable.
col=data[1]
In [19]:
col[np.abs(col)>3]
Out[19]:
446 -3.126541 938 -3.488546 Name: 1, dtype: float64
In [20]:
#To find rows with at least one absolute value exceeding 3 in the entire data set
data[(np.abs(data)>3).any(1)]
Out[20]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
331 | -3.541948 | -1.260699 | 0.602065 | -0.489064 |
396 | -3.121986 | 0.180793 | 0.289388 | -0.949250 |
446 | 1.407799 | -3.126541 | 0.692714 | 0.296811 |
619 | -0.239821 | -1.725038 | -0.766128 | -3.595273 |
676 | -0.037124 | -0.861719 | 3.066394 | 1.267702 |
713 | 3.045511 | -0.058880 | -0.191928 | 0.909453 |
912 | 3.232359 | 0.292739 | -0.357451 | 0.677699 |
938 | -0.731734 | -3.488546 | 0.187923 | -1.593345 |
973 | 0.495800 | -1.063279 | -1.364449 | 3.040598 |