Pandas groupby()

The groupby() method in Pandas is used for grouping rows based on some columns and then performing an aggregation function.

Example

import pandas as pd

# creating a dictionary 
data = {
    'City': ['NY', 'LA', 'NY', 'LA', 'NY'],         
    'Temperature': [55, 78, 56, 76, 54],            
    'Humidity': [65, 50, 60, 49, 63]                
}

# convert dictionary to a dataframe named df
df = pd.DataFrame(data)

# group dataframe by City column and # calculate the mean for each group grouped = df.groupby('City').mean()
# display the grouped and aggregated data print(grouped) ''' Output Temperature Humidity City LA 77.0 49.5 NY 55.0 62.666667 '''

groupby() Syntax

The syntax of the groupby() method in Pandas is:

df.groupby(by=None, axis=0, level=None, sort=True, as_index=True, dropna=True)

groupby() Arguments

The groupby() method takes following arguments:

  • by - determines how to group the data
  • axis (optional) - specifies if we want to group by rows or columns
  • level (optional) - determines which level to use for grouping
  • sort (optional) - sorts the grouped data
  • as_index (optional) - if True, group labels become indices; if False, they remain columns
  • dropna (optional) - exclude null values if True.

groupby() Return Value

The groupby() method returns an object which represents data grouped by a key(s).


Example 1: Grouping by a Single Column in Pandas

import pandas as pd

# create a dictionary containing the data
data = {'Genre': ['Fiction', 'Non-Fiction', 'Fiction', 'Non-Fiction', 'Fiction'],
        'BooksSold': [150, 90, 80, 60, 200]}

# create a DataFrame using the data dictionary
df = pd.DataFrame(data)

# group the DataFrame by the Genre column and # calculate the total number of books sold for each genre grouped = df.groupby('Genre')['BooksSold'].sum()
# print the grouped data print(grouped)

Output

Genre
Fiction        430
Non-Fiction    150
Name: BooksSold, dtype: int64

In the above example, df.groupby('Genre')['BooksSold'].sum() is used to group by a single column and calculate sum.

This line does the following:

  • df.groupby('Category') - groups the df DataFrame by the unique values in the Genre column.
  • [BooksSold] - specifies that we are interested in the 'BooksSold' column within each group.
  • .sum() - calculates the sum of the 'BookSold' values for each group.

Example 2: Use of axis Argument in groupby()

import pandas as pd

# create a sample DataFrame
data = {
    'A': [1, 2, 1, 2],
    'B': [3, 4, 3, 4],
    'C': [5, 6, 7, 8]
}

df = pd.DataFrame(data)

# group along rows (axis=0) based on column A grouped_rows = df.groupby('A', axis=0)
# calculate the sum of each group sum_rows = grouped_rows.sum() print("Grouped along rows:") print(sum_rows)
# group along columns (axis=1) based on index labels grouped_cols = df.groupby(df.columns, axis=1)
# calculate the sum of each group sum_cols = grouped_cols.sum() print("\nGrouped along columns:") print(sum_cols)

Output

Grouped along rows:
A   B   C
       
1   6  12
2   8  14

Grouped along columns:
   A  B  C
0  1  3  5
1  2  4  6
2  1  3  7
3  2  4  8

In this example, we first create a DataFrame with columns A, B, and C.

We then use the groupby() method to group the data along rows based on the A column and along columns based on the index labels (columns A, B, and C).

The axis parameter is used to specify whether the grouping should be done along rows or columns.


Example 3: Use of level argument in groupby()

import pandas as pd

# create a DataFrame with a multi-level index
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
index = pd.MultiIndex.from_tuples([('Group1', 'A'), ('Group1', 'B'), ('Group2', 'A'), ('Group2', 'B'), ('Group3', 'A')],
                                  names=['Group', 'Category'])
df = pd.DataFrame(data, index=index)

# display original DataFrame
print("Original DataFrame:")
print(df)

# group by the Group level and calculate the sum grouped = df.groupby(level='Group').sum()
# display the result print("\nGrouped by 'Group' level and summed:") print(grouped)

Output

Original DataFrame:
                  A   B
Group  Category        
Group1 A          1  10
       B          2  20
Group2 A          3  30
       B          4  40
Group3 A          5  50

Grouped by 'Group' level and summed:
        A   B
Group        
Group1  3  30
Group2  7  70
Group3  5  50

Here, we create a DataFrame with a multi-level index where the levels are named Group and Category.

We then use the groupby() method with the level argument set to Group to group the data by the Group level and calculate the sum for each group.

As a result, we get a new DataFrame with the groups as the index and the sums of columns A and B for each group.


Example 4: Sort the Grouped Data

import pandas as pd

# create a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Value': [10, 15, 20, 25, 30, 35]}
df = pd.DataFrame(data)

# group by Category column and sort the grouped data grouped = df.groupby('Category', sort=True)
# calculate the mean of Value for each group result = grouped['Value'].mean() print(result)

Output

Category
A    20.0
B    25.0
Name: Value, dtype: float64

In this example, we use the groupby() method to group the DataFrame by the Category column and specify sort=True, which means the groups will be sorted alphabetically based on the Category values.


Example 5: Use of as_index Argument in groupby()

The as_index() argument is used to specify whether grouping columns should be treated as index columns or not.

  • as_index=True - grouped columns become the index of the resulting DataFrame
  • as_index=False - grouped columns remain as regular columns in the resulting DataFrame

Let's look at an example.

import pandas as pd

data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Value': [10, 20, 15, 25, 30]}

df = pd.DataFrame(data)

# group by Category and calculate the mean, with 'Category' as a regular column. result1 = df.groupby('Category', as_index=False).mean() print("When as_index=Flase:\n", result1)
print()
# group by Category and calculate the mean, with 'Category' becoming the index. result2 = df.groupby('Category', as_index=True).mean() print("\nWhen as_index=True:\n", result2)

Output

When as_index=Flase:
    Category     Value
0        A       18.333333
1        B       22.500000

When as_index=True:
          Value
Category           
A         18.333333
B         22.500000

Here,

  • when as_index=False, the Category column is not set as the index of the resulting grouped DataFrame, and it remains a regular column
  • when as_index=True, the Category column becomes the index of the resulting grouped DataFrame

Example 6: Grouping by a Multiple Column in Pandas

import pandas as pd

# create a DataFrame with student data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Grade': ['A', 'B', 'A', 'A', 'B'],
    'Score': [90, 85, 92, 88, 78],
    'Attendance': [95, 98, 92, 97, 88]
}

df = pd.DataFrame(data)

# define the aggregate functions to be applied to the Score and Attendance columns
agg_functions = {
    'Score': 'min',        # Calculate the minimum Score
    'Attendance': 'mean'  # Calculate the mean Attendance
}

# group the DataFrame by Gender and Grade, then apply the aggregate functions grouped = df.groupby(['Gender', 'Grade']).aggregate(agg_functions)
# print the resulting grouped DataFrame print(grouped)

Output

               Score  Attendance
Gender  Grade                   
Female  A       88      97.0
        B       85      98.0
Male    A       90      93.5
        B       78      88.0

In the above example,

  1. The DataFrame is grouped by Gender and Grade, creating multi-level row indices.
  2. The Score column shows the minimum score for each combination of Gender and Grade.
  3. The Attendance column shows the mean attendance for each combination of Gender and Grade.

Example 7: Use of dropna Argument in groupby()

The dropna argument specifies how the grouping operation should handle rows with missing values in the columns by which you are grouping your data.

  • When dropna=True - excludes rows with missing values in the grouping columns from the groups
  • When dropna=False - includes rows with missing values in the grouping columns in their own separate group

Let's look at an example.

import pandas as pd

data = {'Category': ['A', 'B', 'A', 'B', None],
        'Value': [10, 20, 15, 25, 30]}

df = pd.DataFrame(data)

# group by Category with dropna=True (default) grouped_true = df.groupby('Category', dropna=True).mean()
print("With grouped_true: ") print(grouped_true) print()
# group by Category with dropna=False grouped_false = df.groupby('Category', dropna=False).mean()
print("With grouped_false: ") print(grouped_false)

Output

With grouped_true: 
          Value
Category       
A          12.5
B          22.5
With grouped_false: 
           Value
Category       
A          12.5
B          22.5
NaN        30.0

Here, rows with None in the Category column are excluded from the grouping when dropna=True.

However, a separate group is created for the rows with None in the Category column when dropna=False.