In large datasets, we often encounter duplicate entries in tables. These duplicate entries can throw off our analysis and skew the results.
Pandas provides several methods to find and remove duplicate entries in DataFrames.
Find Duplicate Entries
We can find duplicate entries in a DataFrame using the duplicated()
method. It returns True
if a row is duplicated and returns False
otherwise.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
'Age': [28, 24, 28, 24, 19],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# check for duplicate entries
print(df.duplicated())
Output
0 False 1 False 2 True 3 True 4 False dtype: bool
In the above example, we checked for duplicate entries in df using the duplicated()
method. It returned a series with boolean values indicating if an entry is a duplicate.
Here, we got True
in the third and the fourth rows because they are duplicates of the first and the second rows respectively.
Example: Find Duplicates Based on Columns
By default, duplicated()
considers all columns. To find duplicates based on certain columns, we can pass them as a list to the duplicated()
function.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Anna', 'Johnny', 'Anna', 'John'],
'Age': [28, 24, 28, 24, 19],
'City': ['New York', 'Las Vegas', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# check for duplicate entries in columns Name and Age
print(df.duplicated(subset=['Name', 'Age']))
Output
0 False 1 False 2 False 3 True 4 False dtype: bool
In this example, we checked the duplicate entries based on Name
and Age
columns only.
If you look at the Name
and Age
columns, the fourth row is a duplicate of the second row. Hence, the boolean value of the fourth row is True
in the output.
Remove Duplicate Entries
We can remove duplicate entries in Pandas using the drop_duplicates()
method. For example,
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
'Age': [28, 24, 28, 24, 19],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# remove duplicates
df.drop_duplicates(inplace=True)
print(df)
Output
Name Age City 0 John 28 New York 1 Anna 24 Los Angeles 4 John 19 Chicago
In this example, we removed duplicate entries from df using drop_duplicates()
.
Here, inplace=True
specifies that the changes are to be made in the original dataframe.
Notice that the drop_duplicates()
function keeps the first duplicate entry and removes the last by default. Here, the first and the second rows are kept while the third and the fourth rows are removed.
To keep the last entry, we can pass the keep='last'
argument. For example,
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
'Age': [28, 24, 28, 24, 19],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# remove duplicates, keep last entries
df.drop_duplicates(keep='last', inplace=True)
print(df)
Output
Name Age City 2 John 28 New York 3 Anna 24 Los Angeles 4 John 19 Chicago