Certification Courses

Created with over a decade of experience and thousands of feedback.

Learn Python

Learn HTML

Learn JavaScript

Learn SQL

Learn DSA

View all Courses on

Learn C

Learn C++

Learn Java

Pandas drop_duplicates()

The drop_duplicates() method in Pandas is used to drop duplicate rows from a DataFrame.

Example

import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)

# drop duplicate rows based on all columns
result = df.drop_duplicates()

# display the result 
print(result)

'''
Output

   Name     Age
0  Alice      25
1  Bob       30
3  Charlie  35

'''

drop_duplicates() Syntax

The syntax of the drop_duplicates() method in Pandas is:

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

drop_duplicates() Arguments

The drop_duplicates() method takes following arguments:

subset (optional) - a list of column names or labels to consider for identifying duplicates
keep (optional) - specifies which duplicates to keep ('first', 'last', or False)
inplace (optional) - If True, modifies the original DataFrame in place; if False, returns a new DataFrame.
ignore_index (optional) - If True, resets the index of the resulting DataFrame to a clean, new index.

drop_duplicates() Return Value

The drop_duplicates() method in Pandas returns a new DataFrame with duplicate rows removed.

Example1: Remove Duplicate Rows Across all Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on all columns
# keeping the first occurrence
result = df.drop_duplicates()

# display the result
print(result)

Output

           Student_ID   Name  Age
0                 1    Alice   18
1                 2      Bob   19
2                 3  Charlie   20
4                 4    David   21
6                 5      Eve   22

In the above example, we have used the drop_duplicates() method to remove duplicate rows across all columns, keeping only the first occurrence of each unique row.

It removes the following duplicate rows:

Row with Student_ID: 2, Name: Bob, Age: 19 (second occurrence of Bob)
Row with Student_ID: 1, Name: Alice, Age: 18 (second occurrence of Alice)

Example 2: Drop Duplicate Rows Based on Subset of Columns

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# drop duplicate rows based on a subset of columns ('Student_ID' and 'Name')
# keeping the first occurrence, and modify original DataFrame in place
df.drop_duplicates(subset=['Student_ID', 'Name'], inplace=True)

# display the result
print(df)

Output

        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

In this example, we have used the drop_duplicates() method with the subset parameter set to ['Student_ID', 'Name'].

This means that duplicates will be identified and removed based on the combination of the Student_ID and Name columns.

Here, the inplace=True argument in drop_duplicates() method indicates that the original DataFrame df is modified in place, and no new DataFrame is created.

Example 3: Use of keep argument in drop_duplicates()

The keep argument specifies which duplicate values to keep. It can take one of the following values:

'first' - keep the first occurrence (default behavior).
'last' - keep the last occurrence.
False - remove all duplicates.

Let's look at an example,

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# keep the first occurrence of each duplicate
df_keep_first = df.drop_duplicates(keep='first')

print("Keep the first occurrence:")
print(df_keep_first)
print()

# keep the last occurrence of each duplicate
df_keep_last = df.drop_duplicates(keep='last')

print("\nKeep the last occurrence:")
print(df_keep_last)
print()

# remove all duplicates
df_remove_all = df.drop_duplicates(keep=False)

print("\nRemove all duplicates:")
print(df_remove_all)

Output

Keep the first occurrence:
        Student_ID     Name  Age
0               1    Alice   18
1               2      Bob   19
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Keep the last occurrence:
        Student_ID     Name  Age
2               3  Charlie   20
3               2      Bob   19
4               4    David   21
5               1    Alice   18
6               5      Eve   22

Remove all duplicates:
        Student_ID     Name  Age
2               3  Charlie   20
4               4    David   21
6               5      Eve   22

Example 4: Reset Index for the Resulting DataFrame

import pandas as pd

# create a sample DataFrame with duplicate data
data = {
    'Student_ID': [1, 2, 3, 2, 4, 1, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
    'Age': [18, 19, 20, 19, 21, 18, 22]
}

df = pd.DataFrame(data)

# set ignore_index to True
df_deduplicated_ignore_index = df.drop_duplicates(subset=['Student_ID', 'Name'], ignore_index=True)

print("With ignore_index=True:")
print(df_deduplicated_ignore_index)
print()

# set ignore_index to False (Default)
df_deduplicated_default_index = df.drop_duplicates(subset=['Student_ID', 'Name'])

print("\nWith ignore_index=False (Default):")
print(df_deduplicated_default_index)

Output

With ignore_index=True:
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
3           4    David   21
4           5      Eve   22

With ignore_index=False (Default):
   Student_ID     Name  Age
0           1    Alice   18
1           2      Bob   19
2           3  Charlie   20
4           4    David   21
6           5      Eve   22

Here,

ignore_index=True results in a DataFrame with a reset index starting from 0.
ignore_index=False is a default behavior, which retains the original index of the DataFrame.

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO

Interactive Courses
Certificates
AI Help
2000+ Challenges

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python
programmer.

Popular Tutorials

Reference Materials

Popular Examples

Pandas drop_duplicates()

Example

drop_duplicates() Syntax

drop_duplicates() Arguments

drop_duplicates() Return Value

Example1: Remove Duplicate Rows Across all Columns

Example 2: Drop Duplicate Rows Based on Subset of Columns

Example 3: Use of keep argument in drop_duplicates()

Example 4: Reset Index for the Resulting DataFrame

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python programmer.

Popular Tutorials

Reference Materials

Popular Examples

Pandas drop_duplicates()

Example

drop_duplicates() Syntax

drop_duplicates() Arguments

drop_duplicates() Return Value

Example1: Remove Duplicate Rows Across all Columns

Example 2: Drop Duplicate Rows Based on Subset of Columns

Example 3: Use of keep argument in drop_duplicates()

Example 4: Reset Index for the Resulting DataFrame

Become a certified Python
programmer.