In a real world scenario, data are taken from various sources which causes inconsistencies in format of the data. For example, a column can have data of integer and string type as the data is copied from different sources.
Such inconsistencies can create challenges, making data analysis difficult or even impossible.
Let's look at an example.
import pandas as pd
# create dataframe
data = {
'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
}
df = pd.DataFrame(data)
# calculate the mean temperature
mean_temperature = df['Temperature'].mean()
print(mean_temperature)
Output
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Here, the Temperature
column contains data in an inconsistent format, with a mixture of float
and string
types, which is causing a TypeError
.
With Pandas, we can handle such issues by converting all the values in a column to a specific format.
Convert Data to Correct Format
We can remove inconsistencies in data by converting a column with inconsistencies to a specific format. For example,
import pandas as pd
# create dataframe
data = {
'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
}
df = pd.DataFrame(data)
# convert temperature column to float
df['Temperature'] = df['Temperature'].astype(float)
# calculate the mean temperature
mean_temperature = df['Temperature'].mean()
print(mean_temperature)
Output
26.560000000000002
In this example, we converted all the values of Temperature
column to float using astype()
. This solves the problem of columns with mixed data.
Handling Mixed Date Formats
Another common example of inconsistency in data format that you often encounter in real life is mixed date formats.
Dates can be represented in various formats such as mm-dd-yyyy
, dd-mm-yyyy
, yyyy-mm-dd
etc. Also, different separators such as /
, -
, .
etc can be used.
We can handle this issue by converting the column containing dates to the DateTime
format.
Let's look at an example.
import pandas as pd
# create a sample dataframe with mixed date formats
df = pd.DataFrame({'date': ['2022-12-01', '01/02/2022', '2022-03-23', '03/02/2022', '3 4 2023', '2023.9.30']})
# convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=True)
print(df)
Output
date 0 2022-12-01 1 2022-02-01 2 2022-03-23 3 2022-02-03 4 2023-04-03 5 2023-09-30
In the above example, we converted the mixed date formats to a uniform yyyy-mm-dd
format. Here,
format='mixed'
: specifies that the format of each given date can be differentdayfirst=True
: specifies that the day should be considered before the month when interpreting dates