Sometimes, a dataset can have inaccurate entries due to reasons such as human errors during data input, sourcing data from unreliable places, etc.
This can significantly undermine the quality and reliability of the data analysis performed on it.
Let's take a DataFrame containing data about students of an all-boys elementary school.
Name Age Gender Standard 0 John 8 M 3 1 Michael 9 M 4 2 Tom 7 M 12 3 Alex 80 F 3 4 Ryan 100 M 5
In the above DataFrame, we can see a few obvious mistakes like:
- The ages of two students are listed as 80 and 100, which is too old for primary school students.
- The gender of Alex is marked as F. Since this is an all-boys school, it is obviously a mistake.
- Tom is listed as being in the 12th standard, which is simply not possible in an elementary school context.
We can handle such wrong data in the following ways:
- Replace Individual Values
- Replace Values Based on a Condition
- Remove Wrong Values
Replace Individual Values
We can see that the value F for Gender
column is an obvious mistake. Let's replace F with M to rectify the error.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
'Age': [8, 9, 7, 80, 100],
'Gender': ['M', 'M', 'M', 'F', 'M'],
'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)
# replace F with M
df.loc[3, 'Gender'] = 'M'
print(df)
Output
Name Age Gender Standard 0 John 8 M 3 1 Michael 9 M 4 2 Tom 7 M 12 3 Alex 80 M 3 4 Ryan 100 M 5
In this example, we replaced the wrong value with the right value using df.loc[]
.
While this method is effective for small datasets, it becomes tedious as the size of the dataset grows.
Replace Values Based on a Condition
In cases where the values need to satisfy a particular condition, we can iterate through the values to check if the condition is satisfied and to make changes accordingly.
For example, the maximum age of students in this elementary school is 14. But there are two students whose ages are 80 and 100. This looks like a typo where an additional zero has been added unintentionally.
Let's fix the error.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
'Age': [8, 9, 7, 80, 100],
'Gender': ['M', 'M', 'M', 'M', 'M'],
'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)
# replace values based on conditions
for i in df.index:
age_val = df.loc[i, 'Age']
if (age_val > 14) and (age_val%10 == 0):
df.loc[i, 'Age'] = age_val/10
print(df)
Output
Name Age Gender Standard 0 John 8 M 3 1 Michael 9 M 4 2 Tom 7 M 12 3 Alex 8 M 3 4 Ryan 10 M 5
In the above example, we replaced ages greater than 14 which are a multiple of 10 by removing a zero from the original values.
We checked if the age is a multiple of 10 because it is highly likely that a higher age which is a multiple of 10 is a typo.
Remove Wrong Values
Some values can't be corrected. For example, the value 12 in Standard
doesn't make any sense as this is an elementary school. But we can't also correct it because we don't know the right standard.
So we need to remove this whole row and any other row with such values.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
'Age': [8, 9, 7, 8, 10],
'Gender': ['M', 'M', 'M', 'M', 'M'],
'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)
# remove mistaken values
for i in df.index:
if df.loc[i,'Standard'] > 8:
df.drop(i, inplace=True)
print(df)
Output
Name Age Gender Standard 0 John 8 M 3 1 Michael 9 M 4 3 Alex 8 M 3 4 Ryan 10 M 5
In this example, we removed the row(s) containing values greater than 8 in the Standard
column.