Pandas join()

The join() method in Pandas allows us to combine two DataFrames based on their indexes.

Example

import pandas as pd

# dataFrames representing scores of two subjects for students
df1 = pd.DataFrame({'Math': [85, 90]}, index=['Alice', 'Bob'])
df2 = pd.DataFrame({'Physics': [92, 88]}, index=['Alice', 'Bob'])

# join the DataFrames result = df1.join(df2)
print(result) ''' Output Math Physics Alice 85 92 Bob 90 88 '''

join() Syntax

The syntax of the join() method in Pandas is:

df.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

join() Arguments

The join() function takes following arguments:

  • other - DataFrame to be join
  • on (optional) - column to join on the index in other
  • how (optional) - specifies how to join dataframes. Default is left
  • lsuffix & rsuffix (optional) - suffixes to use from left and right DataFrame columns when overlapping columns are encountered.
  • sort (optional) - sort the join keys. Default is False.

join() Return Value

The join() method returns a new DataFrame that results from joining two (or more) dataframes.


Example1: Join Two DataFrames

import pandas as pd

# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
    'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])

# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
    'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])

# join the two DataFrames using the default left join result = math_scores.join(physics_scores)
print(result)

Output

          Math_Score  Physics_Score
Alice            85           92.0
Bob              90           84.0
Charlie          78           79.0
David            88            NaN

In this example, math_scores and physics_scores are joined based on their student names (which are indices).

Here, Alice, Bob, and Charlie have both Math and Physics scores. However, David only has a Math score, thus, his Physics score is NaN.

And, Eva and Frank, who only appear in the Physics score list, are not included in the result because we used a default left join.

Note: If you were to use an outer join, Eva and Frank would also be included in the resulting DataFrame.


Example2: Perform Outer Join

import pandas as pd

# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
    'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])

# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
    'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])

# join the two DataFrames using an outer join result = math_scores.join(physics_scores, how='outer')
print(result)

Output

          Math_Score  Physics_Score
Alice          85.0           92.0
Bob            90.0           84.0
Charlie        78.0           79.0
David          88.0            NaN
Eva             NaN           88.0
Frank           NaN           86.0

Here,

  1. Alice, Bob, and Charlie have scores for both Math and Physics.
  2. David only has a score for Math.
  3. Eva and Frank only have scores for Physics.

The NaN values indicate the absence of scores in one of the subjects for the respective students.


Example 3: Perform Right Join

import pandas as pd

# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
    'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])

# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
    'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using a right join result = math_scores.join(physics_scores, how='right')
print(result)

Output

            Math_Score  Physics_Score
Alice          85.0             92
Bob            90.0             84
Eva             NaN             88
Charlie        78.0             79
Frank           NaN             86

Here,

  1. All unique indices from the right DataFrame physics_scores are included in the output.
  2. Only Alice, Bob, Charlie, and Eva have matching scores in the left DataFrame math_scores, so their Math scores are displayed.
  3. David and Frank, who don't have matching Math scores, show NaN values in the Math_Score column.
  4. All students have Physics scores, so there are no NaN values in the Physics_Score column.

Example 4: Setting a New Column as Index

import pandas as pd

# DataFrame representing Math scores of students
math_scores = pd.DataFrame({
    'Student_Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Math_Score': [85, 90, 78, 88],
})

# DataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
    'Student_Name': ['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'],
    'Physics_Score': [92, 84, 88, 79, 86],
})

# join the two DataFrames using the 'Student_Name' column as the key result = math_scores.join(physics_scores.set_index('Student_Name'), on='Student_Name')
print(result)

Output

      Student_Name      Math_Score     Physics_Score
0        Alice               85             92.0
1          Bob               90             84.0
2      Charlie              78             79.0
3        David              88              NaN

Here, in the join() method, we specified on='Student_Name' to indicate that the join should be performed based on the Student_Name column.

Since the join is performed based on the Student_Name column, we can see the math and physics scores combined for each student.

Students Eva and Frank from the physics_scores DataFrame are not included in the result because they do not have matching Student_Name values in the math_scores DataFrame.


Example 5: Perform a Left Join With Suffixes

import pandas as pd

# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78],
})

df2 = pd.DataFrame({
    'ID': [2, 3, 4],
    'Name': ['Bob', 'Charlie', 'David'],
    'Score': [92, 84, 88],
})

# perform a left join with suffixes result_with_suffixes = df1.join(df2.set_index('ID'), on='ID', how='left', lsuffix='_left', rsuffix='_right')
print("Result with suffixes:") print(result_with_suffixes)

Output

Result with suffixes:
    ID     Name_left  Score_left     Name_right    Score_right
0   1     Alice             85            NaN             NaN
1   2       Bob             90            Bob              92.0
2   3   Charlie            78           Charlie            84.0

In the above example, the lsuffix='_left' and rsuffix='_right' parameters are used to specify custom suffixes for the columns from the left and right DataFrames in the result.

Hence, in the output we can see that the columns from the left DataFrame df1 have _left suffixes, and the columns from the right DataFrame df2 have _right suffixes to differentiate them.

The left join retains all rows from df1 and adds matching rows from df2.


Example 6: Sort the Join Keys

import pandas as pd

# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
    'ID': [3, 1, 2],
    'Name': ['Charlie', 'Alice', 'Bob'],
    'Score': [78, 85, 90],
})

df2 = pd.DataFrame({
    'ID': [2, 4, 3],
    'Grade': ['B', 'A', 'C'],
})

# perform a left join without specifying sort (default: sort=False) result_without_sort = df1.join(df2.set_index('ID'), on='ID', how='left')
# perform a left join with sort=True result_with_sort = df1.join(df2.set_index('ID'), on='ID', how='left', sort=True)
print("Result without sort:") print(result_without_sort) print("\nResult with sort:") print(result_with_sort)

Output

Result without sort:
   ID    Name   Score  Grade
0   3  Charlie     78     C
1   1    Alice     85   NaN
2   2      Bob     90     B

Result with sort:
    ID    Name    Score     Grade
1   1    Alice      85       NaN
2   2      Bob      90        B
0   3  Charlie      78       C

Here, In the result with sorting sort=True, the result is sorted in ascending order based on the ID column.