Pandas drop_duplicates()

The drop_duplicates() method in Pandas is used to drop duplicate rows from a DataFrame.

Example

import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

# drop duplicate rows based on all columns result = df.drop_duplicates()
# display the result print(result) ''' Output Name Age 0 Alice 25 1 Bob 30 3 Charlie 35 '''

drop_duplicates() Syntax

The syntax of the drop_duplicates() method in Pandas is:

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

drop_duplicates() Arguments

The drop_duplicates() method takes following arguments:

  • subset (optional) - a list of column names or labels to consider for identifying duplicates
  • keep (optional) - specifies which duplicates to keep ('first', 'last', or False)
  • inplace (optional) - If True, modifies the original DataFrame in place; if False, returns a new DataFrame.
  • ignore_index (optional) - If True, resets the index of the resulting DataFrame to a clean, new index.

drop_duplicates() Return Value

The drop_duplicates() method in Pandas returns a new DataFrame with duplicate rows removed.


Example1: Remove Duplicate Rows Across all Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on all columns # keeping the first occurrence result = df.drop_duplicates()
# display the result print(result)

Output

           Student_ID   Name  Age
0                 1    Alice   18
1                 2      Bob   19
2                 3  Charlie   20
4                 4    David   21
6                 5      Eve   22

In the above example, we have used the drop_duplicates() method to remove duplicate rows across all columns, keeping only the first occurrence of each unique row.

It removes the following duplicate rows:

  1. Row with Student_ID: 2, Name: Bob, Age: 19 (second occurrence of Bob)
  2. Row with Student_ID: 1, Name: Alice, Age: 18 (second occurrence of Alice)

Example 2: Drop Duplicate Rows Based on Subset of Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on a subset of columns ('Student_ID' and 'Name') # keeping the first occurrence, and modify original DataFrame in place df.drop_duplicates(subset=['Student_ID', 'Name'], inplace=True)
# display the result print(df)

Output

        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

In this example, we have used the drop_duplicates() method with the subset parameter set to ['Student_ID', 'Name'].

This means that duplicates will be identified and removed based on the combination of the Student_ID and Name columns.

Here, the inplace=True argument in drop_duplicates() method indicates that the original DataFrame df is modified in place, and no new DataFrame is created.


Example 3: Use of keep argument in drop_duplicates()

The keep argument specifies which duplicate values to keep. It can take one of the following values:

  1. 'first' - keep the first occurrence (default behavior).
  2. 'last' - keep the last occurrence.
  3. False - remove all duplicates.

Let's look at an example,

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# keep the first occurrence of each duplicate df_keep_first = df.drop_duplicates(keep='first')
print("Keep the first occurrence:") print(df_keep_first) print()
# keep the last occurrence of each duplicate df_keep_last = df.drop_duplicates(keep='last')
print("\nKeep the last occurrence:") print(df_keep_last) print()
# remove all duplicates df_remove_all = df.drop_duplicates(keep=False)
print("\nRemove all duplicates:") print(df_remove_all)

Output

Keep the first occurrence:
        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Keep the last occurrence:
        Student_ID     Name  Age
2               3  Charlie   20
3               2      Bob   19
4               4    David   21
5               1    Alice   18
6               5      Eve   22

Remove all duplicates:
        Student_ID     Name  Age
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Example 4: Reset Index for the Resulting DataFrame

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# set ignore_index to True df_deduplicated_ignore_index = df.drop_duplicates(subset=['Student_ID', 'Name'], ignore_index=True)
print("With ignore_index=True:") print(df_deduplicated_ignore_index) print()
# set ignore_index to False (Default) df_deduplicated_default_index = df.drop_duplicates(subset=['Student_ID', 'Name'])
print("\nWith ignore_index=False (Default):") print(df_deduplicated_default_index)

Output

With ignore_index=True:
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
3           4    David   21
4           5      Eve   22

With ignore_index=False (Default):
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
4           4    David   21
6           5      Eve   22

Here,

  1. ignore_index=True results in a DataFrame with a reset index starting from 0.
  2. ignore_index=False is a default behavior, which retains the original index of the DataFrame.