Combining and Merging in Pandas¶
The datasets you want to analyze can come from different sources. Before starting data analyses, you may want to merge these datasets. To combine datasets, you can use the merge, join, and concat methods in Pandas. Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.
- pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Piyush', 'Gautam', 'Garry', 'Abhishek', 'Ankit'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
df1
id | Name | subject_id | |
---|---|---|---|
0 | 1 | Piyush | sub1 |
1 | 2 | Gautam | sub2 |
2 | 3 | Garry | sub4 |
3 | 4 | Abhishek | sub6 |
4 | 5 | Ankit | sub5 |
df2 = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Saurabh', 'Akshay', 'Rajat', 'Ankush', 'Sam'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
df2
id | Name | subject_id | |
---|---|---|---|
0 | 1 | Saurabh | sub2 |
1 | 2 | Akshay | sub4 |
2 | 3 | Rajat | sub3 |
3 | 4 | Ankush | sub6 |
4 | 5 | Sam | sub5 |
Merging DataFrame¶
To combine rows using one or more keys, use the merge or join procedures. Let’s make two datasets to demonstrate this.
pd.merge(df1,df2)
id | Name | subject_id |
---|
Merge Two DataFrames on Column ID¶
pd.merge(df1,df2,on='id')
id | Name_x | subject_id_x | Name_y | subject_id_y | |
---|---|---|---|---|---|
0 | 1 | Piyush | sub1 | Saurabh | sub2 |
1 | 2 | Gautam | sub2 | Akshay | sub4 |
2 | 3 | Garry | sub4 | Rajat | sub3 |
3 | 4 | Abhishek | sub6 | Ankush | sub6 |
4 | 5 | Ankit | sub5 | Sam | sub5 |
Merge Two DataFrames on Column Subject ID¶
pd.merge(df1,df2,on='subject_id')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2 | Gautam | sub2 | 1 | Saurabh |
1 | 3 | Garry | sub4 | 2 | Akshay |
2 | 4 | Abhishek | sub6 | 4 | Ankush |
3 | 5 | Ankit | sub5 | 5 | Sam |
Merge Two DataFrames on Column ID and Subject ID Together¶
(pd.merge(df1,df2,on=['id','subject_id']))
id | Name_x | subject_id | Name_y | |
---|---|---|---|---|
0 | 4 | Abhishek | sub6 | Ankush |
1 | 5 | Ankit | sub5 | Sam |
Inner Join (default)¶
pd.merge(df1,df2, on='subject_id')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2 | Gautam | sub2 | 1 | Saurabh |
1 | 3 | Garry | sub4 | 2 | Akshay |
2 | 4 | Abhishek | sub6 | 4 | Ankush |
3 | 5 | Ankit | sub5 | 5 | Sam |
pd.merge(df1,df2, on='subject_id', how='inner')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2 | Gautam | sub2 | 1 | Saurabh |
1 | 3 | Garry | sub4 | 2 | Akshay |
2 | 4 | Abhishek | sub6 | 4 | Ankush |
3 | 5 | Ankit | sub5 | 5 | Sam |
Left Join¶
pd.merge(df1,df2, on='subject_id', how='left')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 1 | Piyush | sub1 | NaN | NaN |
1 | 2 | Gautam | sub2 | 1.0 | Saurabh |
2 | 3 | Garry | sub4 | 2.0 | Akshay |
3 | 4 | Abhishek | sub6 | 4.0 | Ankush |
4 | 5 | Ankit | sub5 | 5.0 | Sam |
Right Join¶
pd.merge(df1,df2, on='subject_id', how='right')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2.0 | Gautam | sub2 | 1 | Saurabh |
1 | 3.0 | Garry | sub4 | 2 | Akshay |
2 | NaN | NaN | sub3 | 3 | Rajat |
3 | 4.0 | Abhishek | sub6 | 4 | Ankush |
4 | 5.0 | Ankit | sub5 | 5 | Sam |
Outer Join¶
pd.merge(df1,df2, on='subject_id', how='outer')
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 1.0 | Piyush | sub1 | NaN | NaN |
1 | 2.0 | Gautam | sub2 | 1.0 | Saurabh |
2 | 3.0 | Garry | sub4 | 2.0 | Akshay |
3 | 4.0 | Abhishek | sub6 | 4.0 | Ankush |
4 | 5.0 | Ankit | sub5 | 5.0 | Sam |
5 | NaN | NaN | sub3 | 3.0 | Rajat |
Another Example¶
One to One Join¶
Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation
df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df2 = pd.DataFrame({'employee': ['Akshay', 'Karan', 'Robin', 'Arpit'],'hire_date': [2018, 2017, 2015, 2014]})
df1
employee | group | |
---|---|---|
0 | Karan | IT |
1 | Robin | Sales |
2 | Akshay | Sales |
3 | Arpit | HR |
df2
employee | hire_date | |
---|---|---|
0 | Akshay | 2018 |
1 | Karan | 2017 |
2 | Robin | 2015 |
3 | Arpit | 2014 |
To combine this information into a single DataFrame, we can use the pd.merge() function:¶
df3 = pd.merge(df1, df2)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Karan | IT | 2017 |
1 | Robin | Sales | 2015 |
2 | Akshay | Sales | 2018 |
3 | Arpit | HR | 2014 |
Many to One Joins¶
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.
df4 = pd.DataFrame({'group': ['IT', 'Sales', 'Sales', 'HR'],'supervisor': ['Keisha', 'Priya', 'Chetna','Atharv']})
df4
group | supervisor | |
---|---|---|
0 | IT | Keisha |
1 | Sales | Priya |
2 | Sales | Chetna |
3 | HR | Atharv |
pd.merge(df3, df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Karan | IT | 2017 | Keisha |
1 | Robin | Sales | 2015 | Priya |
2 | Robin | Sales | 2015 | Chetna |
3 | Akshay | Sales | 2018 | Priya |
4 | Akshay | Sales | 2018 | Chetna |
5 | Arpit | HR | 2014 | Atharv |
The resulting DataFrame has an additional column with the “supervisor” information, where the information is repeated in one or more locations as required by the inputs.
Many to Many Joins¶
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
df5 = pd.DataFrame({'group': ['IT', 'Sales', 'Sales', 'HR','HR','IT'],'skills': ['Python', 'Digital Marketing', 'Spreadsheets', 'Email-Marketing','Tele-Calling', 'Data Science']})
df5
group | skills | |
---|---|---|
0 | IT | Python |
1 | Sales | Digital Marketing |
2 | Sales | Spreadsheets |
3 | HR | Email-Marketing |
4 | HR | Tele-Calling |
5 | IT | Data Science |
pd.merge(df1, df5)
employee | group | skills | |
---|---|---|---|
0 | Karan | IT | Python |
1 | Karan | IT | Data Science |
2 | Robin | Sales | Digital Marketing |
3 | Robin | Sales | Spreadsheets |
4 | Akshay | Sales | Digital Marketing |
5 | Akshay | Sales | Spreadsheets |
6 | Arpit | HR | Email-Marketing |
7 | Arpit | HR | Tele-Calling |
The left_on and right_on keywords¶
At times you may wish to merge two datasets with different column names; for exam‐ ple, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:
df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df1
employee | group | |
---|---|---|
0 | Karan | IT |
1 | Robin | Sales |
2 | Akshay | Sales |
3 | Arpit | HR |
df2 = pd.DataFrame({'name': ['Karan', 'Robin', 'Akshay', 'Arpit'],'salary': [70000, 80000, 120000, 90000]})
df2
name | salary | |
---|---|---|
0 | Karan | 70000 |
1 | Robin | 80000 |
2 | Akshay | 120000 |
3 | Arpit | 90000 |
pd.merge(df1, df2, left_on='employee', right_on='name')
employee | group | name | salary | |
---|---|---|---|---|
0 | Karan | IT | Karan | 70000 |
1 | Robin | Sales | Robin | 80000 |
2 | Akshay | Sales | Akshay | 120000 |
3 | Arpit | HR | Arpit | 90000 |
The left_index and right_index keywords¶
Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:
df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df2 = pd.DataFrame({'employee': ['Akshay', 'Karan', 'Robin', 'Arpit'],'hire_date': [2018, 2017, 2015, 2014]})
df1
employee | group | |
---|---|---|
0 | Karan | IT |
1 | Robin | Sales |
2 | Akshay | Sales |
3 | Arpit | HR |
df2
employee | hire_date | |
---|---|---|
0 | Akshay | 2018 |
1 | Karan | 2017 |
2 | Robin | 2015 |
3 | Arpit | 2014 |
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
df1a
group | |
---|---|
employee | |
Karan | IT |
Robin | Sales |
Akshay | Sales |
Arpit | HR |
df2a
hire_date | |
---|---|
employee | |
Akshay | 2018 |
Karan | 2017 |
Robin | 2015 |
Arpit | 2014 |
pd.merge(df1a, df2a, left_index=True, right_index=True)
group | hire_date | |
---|---|---|
employee | ||
Karan | IT | 2017 |
Robin | Sales | 2015 |
Akshay | Sales | 2018 |
Arpit | HR | 2014 |
Concatenate¶
df1 = pd.DataFrame({
'Name': ['Akash', 'Abhishek', 'Arpit', 'aayush', 'Ankit'],
'subject':['sub1','sub2','sub4','sub6','sub5'],
'Marks':[98,90,87,69,78]},
index=[1,2,3,4,5])
df1
Name | subject | Marks | |
---|---|---|---|
1 | Akash | sub1 | 98 |
2 | Abhishek | sub2 | 90 |
3 | Arpit | sub4 | 87 |
4 | aayush | sub6 | 69 |
5 | Ankit | sub5 | 78 |
df2= pd.DataFrame({
'Name': ['Bobby', 'Bhavnish', 'Bhavya', 'Bonny', 'Brett'],
'subject':['sub2','sub4','sub3','sub6','sub5'],
'Marks':[89,80,79,97,88]},
index=[1,2,3,4,5])
df2
Name | subject | Marks | |
---|---|---|---|
1 | Bobby | sub2 | 89 |
2 | Bhavnish | sub4 | 80 |
3 | Bhavya | sub3 | 79 |
4 | Bonny | sub6 | 97 |
5 | Brett | sub5 | 88 |
pd.concat([df1,df2])
Name | subject | Marks | |
---|---|---|---|
1 | Akash | sub1 | 98 |
2 | Abhishek | sub2 | 90 |
3 | Arpit | sub4 | 87 |
4 | aayush | sub6 | 69 |
5 | Ankit | sub5 | 78 |
1 | Bobby | sub2 | 89 |
2 | Bhavnish | sub4 | 80 |
3 | Bhavya | sub3 | 79 |
4 | Bonny | sub6 | 97 |
5 | Brett | sub5 | 88 |
Reset index¶
pd.concat([df1,df2],ignore_index=True)
Name | subject | Marks | |
---|---|---|---|
0 | Akash | sub1 | 98 |
1 | Abhishek | sub2 | 90 |
2 | Arpit | sub4 | 87 |
3 | aayush | sub6 | 69 |
4 | Ankit | sub5 | 78 |
5 | Bobby | sub2 | 89 |
6 | Bhavnish | sub4 | 80 |
7 | Bhavya | sub3 | 79 |
8 | Bonny | sub6 | 97 |
9 | Brett | sub5 | 88 |
Along the Axis = 0¶
pd.concat([df1,df2],ignore_index=True,axis=0)
Name | subject | Marks | |
---|---|---|---|
0 | Akash | sub1 | 98 |
1 | Abhishek | sub2 | 90 |
2 | Arpit | sub4 | 87 |
3 | aayush | sub6 | 69 |
4 | Ankit | sub5 | 78 |
5 | Bobby | sub2 | 89 |
6 | Bhavnish | sub4 | 80 |
7 | Bhavya | sub3 | 79 |
8 | Bonny | sub6 | 97 |
9 | Brett | sub5 | 88 |
Along the axis = 1¶
pd.concat([df1,df2],ignore_index=True,axis=1)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
1 | Akash | sub1 | 98 | Bobby | sub2 | 89 |
2 | Abhishek | sub2 | 90 | Bhavnish | sub4 | 80 |
3 | Arpit | sub4 | 87 | Bhavya | sub3 | 79 |
4 | aayush | sub6 | 69 | Bonny | sub6 | 97 |
5 | Ankit | sub5 | 78 | Brett | sub5 | 88 |