Categorical data is a type of data that represents categories or labels rather than numerical values.
In simple words, it is a way of classifying into distinct categories, such as genders, country names, or education levels.
Categorical data is handy when we have data that naturally fit into predefined options.
Create Categorical Data Type in Pandas
In Pandas, the Categorical()
method is used to create a categorical data type from a given sequence of values.
import pandas as pd
data = ['red', 'blue', 'green', 'red', 'blue']
# create a categorical column
categorical_data = pd.Categorical(data)
print(categorical_data)
Output
['red', 'blue', 'green', 'red', 'blue'] Categories (3, object): ['blue', 'green', 'red']
In the above example, the Categorical()
function converts the data list into a categorical series.
The output includes the original data values and a list of unique categories present in the data.
Convert Pandas Series to Categorical Series
In Pandas, we can convert a regular Pandas Series to a Categorical Series using either the astype()
function or the dtype
parameter within the pd.Series()
constructor.
Using the astype() Function
import pandas as pd
# create a regular Series
data = ['red', 'blue', 'green', 'red', 'blue']
series1 = pd.Series(data)
# convert the Series to a categorical Series using .astype()
categorical_s = series1.astype('category')
print(categorical_s)
Output
0 red 1 blue 2 green 3 red 4 blue dtype: category Categories (3, object): ['blue', 'green', 'red']
Here, series1.astype('category')
specifies we want to convert the series1 series into a categorical series.
Using the dtype parameter Inside Series()
import pandas as pd
# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")
print(cat_series)
Here, we have used the dtype="category"
parameter inside Series() to convert normal series into categorical series.
The output will be the same as above.
Access Categories and Codes in Pandas
In Pandas, the cat
accessor allows us to access categories and codes. Here's the attributes provided by the cat
accessor to access categories and codes:
categories
- returns the unique categories present in the categorical variablecodes
- returns the integer codes representing the categories for each element in the categorical variable
Let's look at an example.
import pandas as pd
# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")
# using .cat accessor
print(cat_series.cat.categories)
print(cat_series.cat.codes)
Output
Index(['A', 'B', 'C'], dtype='object') 0 0 1 1 2 0 3 2 4 1 dtype: int8
In the above example, first we have used cat_series.cat.categories
to access the unique categories present in cat_series.
In this case, the output will be Index(['A', 'B', 'C'], dtype='object')
, which are the distinct categories in the data.
Then, we have used cat_series.cat.codes
to access the integer codes corresponding to the categories in cat_series.
Let's see how we got the output,
0 0
1 1
2 0
3 2
4 1
Here,
- The element at index 0 of cat_series is
A
, which corresponds to category 0. - The element at index 1 of cat_series is
B
, which corresponds to category 1. - The element at index 2 of cat_series is
A
, which again corresponds to category 0. - The element at index 3 of cat_series is
C
, which corresponds to category 2. - The element at index 4 of cat_series is
B
, which again corresponds to category 1.
Rename Categories in Pandas
We can rename the categories in Pandas using the cat.rename_categories()
method. For example,
import pandas as pd
# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")
# create a dictionary for renaming categories
category_mapping = {"A": "Category A", "B": "Category B", "C": "Category C"}
# rename categories using .rename_categories() and recreate the Series
cat_series_renamed = cat_series.cat.rename_categories(category_mapping)
print(cat_series_renamed)
Output
0 Category A 1 Category B 2 Category A 3 Category C 4 Category B dtype: category Categories (3, object): ['Category A', 'Category B', 'Category C']
In this example, the categories A
, B
, and C
are renamed to Category A
, Category B
, and Category C
respectively.
Add New Categories in Pandas
In Pandas, we can add new categories to the existing set of categories using the cat.add_categories()
method.
Let's look at an example.
import pandas as pd
# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")
# add new categories and reassign the variable
new_categories = ['D', 'E']
cat_series = cat_series.cat.add_categories(new_categories)
print(cat_series)
Output
0 A 1 B 2 A 3 C 4 B dtype: category Categories (5, object): ['A', 'B', 'C', 'D', 'E']
Here, we added the new categories D
and E
to the categorical Series, and the result was assigned back to cat_series, effectively updating the variable with the new categories.
Remove Categories in Pandas
To remove categories from a categorical variable in Pandas, we can use the cat.remove_categories()
method.
Let's look at an example.
import pandas as pd
# create a categorical Series
data = ['A', 'B', 'A', 'C', 'B']
cat_series = pd.Series(data, dtype="category")
# display the original categorical variable
print("Original Series:")
print(cat_series)
# remove specific categories
categories_to_remove = ["B", "C"]
cat_series_removed = cat_series.cat.remove_categories(categories_to_remove)
# display the modified categorical variable
print("\nModified Series:")
print(cat_series_removed)
Output
Original Series: 0 A 1 B 2 A 3 C 4 B dtype: category Categories (3, object): ['A', 'B', 'C'] Modified Series: 0 A 1 NaN 2 A 3 NaN 4 NaN dtype: category Categories (1, object): ['A']
In this example, we have used the cat.remove_categories()
to remove the categories B
and C
from cat_series.
Check if Categorical Variable is Ordered or Not
In Pandas, to check if a categorical variable is ordered, you can use the ordered
attribute provided by the cat
accessor in pandas. For example,
import pandas as pd
# create an ordered categorical Series
data = ['low', 'medium', 'high', 'low', 'medium']
ordered_cat_series = pd.Categorical(data, categories=['low', 'medium', 'high'], ordered=True)
# check if the categorical variable is ordered
is_ordered = ordered_cat_series.ordered
print("Is ordered:", is_ordered)
Output
Is ordered: True
In this example, ordered_cat_series.ordered
will be True
because the categorical variable ordered_cat_series was created with the ordered=True
parameter.
Note: Ordering categorical variables in Pandas helps in maintaining a logical sequence for analysis and visualization. Recognizing this order ensures accurate statistical tests, meaningful visual representations, and consistent data interpretation.