Pandas crosstab()

The crosstab() method in Pandas allows us to create contingency tables, also known as cross-tabulations.

A contingency table helps us understand the relationship between two or more categorical variables within a dataset.

Example

import pandas as pd

# sample DataFrame
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'])

print(cross_tab)

'''
Output

Smoker  No  Yes
Gender         
Female   2    0
Male     1    2
'''

crosstab() Syntax

The syntax of the crosstab() method in Pandas is:

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)

crosstab() Arguments

The crosstab() method has the following arguments:

  • index: the column or array-like object whose values will be used as rows
  • columns: the column or array-like object whose values will be used as columns
  • values (optional): the column to aggregate values based on the intersection of index and columns
  • rownames (optional): the names to be used for the row index
  • colnames (optional): the names to be used for the column index
  • aggfunc (optional): the aggregation function to apply to values
  • margins (optional): whether to include row and column margins
  • margins_name (optional): the name to be used for the margin labels
  • dropna (optional): whether to exclude missing values
  • normalize (optional): whether to normalize the values to show proportions.

crosstab() Return Value

The crosstab() method returns a DataFrame representing the cross-tabulation of the factors specified in index and columns.


Example 1: Basic Cross-Tabulation

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Employed': ['Yes', 'Yes', 'Yes', 'Yes', 'No']}

df = pd.DataFrame(data)

# create a basic cross-tabulation of Gender and Employed
cross_tab = pd.crosstab(df['Gender'], df['Employed'])

print(cross_tab)

Output

Employed  No  Yes
Gender            
Female      0    2
Male        1    2

In this example, we created a basic cross-tabulation of Gender and Employed to understand the distribution of employed and unemployed people among genders.


Example2: Margins in crosstab()

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a cross-tabulation with margins
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], margins=True, margins_name='Total')

print(cross_tab)

Output

Smoker  No  Yes  Total
Gender                
Female   2    0      2
Male     1    2      3
Total    3    2      5

In this example, we included row and column margins in the cross-tabulation to show the totals for each row and column.


Example 3: Normalized Cross-Tabulation

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a normalized cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], normalize=True)

print(cross_tab)

Output

Smoker        No       Yes
Gender                    
Female  0.166667  0.166667
Male    0.333333  0.333333

In this example, we created a normalized cross-tabulation to show proportions instead of raw counts.


Example 4: Aggregate Functions with crosstab()

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No'],
        'Age': [25, 30, 35, 40, 45]}

df = pd.DataFrame(data)

# create a cross-tabulation of Gender and Smoker with average Age as the aggregation
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], values=df['Age'], aggfunc='mean')

print(cross_tab)

Output

Smoker    No   Yes
Gender            
Female  35.0   NaN
Male    45.0  30.0

In this example, we used aggfunc=mean to calculate the mean age for smokers and non smokers of different genders.