Pandas drop_duplicates()

The drop_duplicates() method in Pandas is used to drop duplicate rows from a DataFrame.

Example

import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

# drop duplicate rows based on all columns result = df.drop_duplicates()
# display the result print(result) ''' Output Name Age 0 Alice 25 1 Bob 30 3 Charlie 35 '''

drop_duplicates() Syntax

The syntax of the drop_duplicates() method in Pandas is:

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

drop_duplicates() Arguments

The drop_duplicates() method takes following arguments:

  • subset (optional) - a list of column names or labels to consider for identifying duplicates
  • keep (optional) - specifies which duplicates to keep ('first', 'last', or False)
  • inplace (optional) - If True, modifies the original DataFrame in place; if False, returns a new DataFrame.
  • ignore_index (optional) - If True, resets the index of the resulting DataFrame to a clean, new index.

drop_duplicates() Return Value

The drop_duplicates() method in Pandas returns a new DataFrame with duplicate rows removed.


Example1: Remove Duplicate Rows Across all Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on all columns # keeping the first occurrence result = df.drop_duplicates()
# display the result print(result)

Output

           Student_ID   Name  Age
0                 1    Alice   18
1                 2      Bob   19
2                 3  Charlie   20
4                 4    David   21
6                 5      Eve   22

In the above example, we have used the drop_duplicates() method to remove duplicate rows across all columns, keeping only the first occurrence of each unique row.

It removes the following duplicate rows:

  1. Row with Student_ID: 2, Name: Bob, Age: 19 (second occurrence of Bob)
  2. Row with Student_ID: 1, Name: Alice, Age: 18 (second occurrence of Alice)

Example 2: Drop Duplicate Rows Based on Subset of Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on a subset of columns ('Student_ID' and 'Name') # keeping the first occurrence, and modify original DataFrame in place df.drop_duplicates(subset=['Student_ID', 'Name'], inplace=True)
# display the result print(df)

Output

        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

In this example, we have used the drop_duplicates() method with the subset parameter set to ['Student_ID', 'Name'].

This means that duplicates will be identified and removed based on the combination of the Student_ID and Name columns.

Here, the inplace=True argument in drop_duplicates() method indicates that the original DataFrame df is modified in place, and no new DataFrame is created.


Example 3: Use of keep argument in drop_duplicates()

The keep argument specifies which duplicate values to keep. It can take one of the following values:

  1. 'first' - keep the first occurrence (default behavior).
  2. 'last' - keep the last occurrence.
  3. False - remove all duplicates.

Let's look at an example,

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# keep the first occurrence of each duplicate df_keep_first = df.drop_duplicates(keep='first')
print("Keep the first occurrence:") print(df_keep_first) print()
# keep the last occurrence of each duplicate df_keep_last = df.drop_duplicates(keep='last')
print("\nKeep the last occurrence:") print(df_keep_last) print()
# remove all duplicates df_remove_all = df.drop_duplicates(keep=False)
print("\nRemove all duplicates:") print(df_remove_all)

Output

Keep the first occurrence:
        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Keep the last occurrence:
        Student_ID     Name  Age
2               3  Charlie   20
3               2      Bob   19
4               4    David   21
5               1    Alice   18
6               5      Eve   22

Remove all duplicates:
        Student_ID     Name  Age
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Example 4: Reset Index for the Resulting DataFrame

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# set ignore_index to True df_deduplicated_ignore_index = df.drop_duplicates(subset=['Student_ID', 'Name'], ignore_index=True)
print("With ignore_index=True:") print(df_deduplicated_ignore_index) print()
# set ignore_index to False (Default) df_deduplicated_default_index = df.drop_duplicates(subset=['Student_ID', 'Name'])
print("\nWith ignore_index=False (Default):") print(df_deduplicated_default_index)

Output

With ignore_index=True:
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
3           4    David   21
4           5      Eve   22

With ignore_index=False (Default):
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
4           4    David   21
6           5      Eve   22

Here,

  1. ignore_index=True results in a DataFrame with a reset index starting from 0.
  2. ignore_index=False is a default behavior, which retains the original index of the DataFrame.

Your builder path starts here. Builders don't just know how to code, they create solutions that matter.

Escape tutorial hell and ship real projects.

Try Programiz PRO
  • Real-World Projects
  • On-Demand Learning
  • AI Mentor
  • Builder Community