Pandas cut()

The cut() method in Pandas is used for segmenting and sorting data values into bins.

Example

import pandas as pd

# create a list of ages
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

# define the bins - age ranges
bins = [18, 25, 35, 60, 100]

# use cut() to categorize each age into the defined bins categories = pd.cut(ages, bins)
print(categories) ''' Output [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] '''

cut() Syntax

The syntax of the cut() method in Pandas is:

Pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

cut() Arguments

The cut() method takes following arguments:

  • x - the input array to be binned
  • bins - the criteria to bin by
  • right (optional) - indicates whether bins include the rightmost edge
  • labels (optional) - specifies the labels for the returned bins
  • retbins (optional) - specifies whether to return the bins or not
  • precision (optional) - precision at which to store and display the bins labels
  • include_lowest (optional) - whether the first interval should be left-inclusive or not.

cut() Return Value

The cut() method in Pandas returns a Series or an array that represents the specific bin or category each original value in the input data belongs to.


Example 1: Categorizing Data Using cut()

import pandas as pd

# create a list of exam scores
scores = [88, 92, 75, 85, 78, 95, 64, 82, 90, 73, 67, 99]

# define the bins - grading ranges
bins = [0, 60, 70, 80, 90, 100]

# use cut() to categorize each score into the defined grading bins grade_categories = pd.cut(scores, bins)
print(grade_categories)

Output

[(80, 90], (90, 100], (70, 80], (80, 90], (70, 80], ..., (80, 90], (80, 90], (70, 80], (60, 70], (90, 100]]
Length: 12
Categories (5, interval[int64, right]): [(0, 60] < (60, 70] < (70, 80] < (80, 90] < (90, 100]]

In the above example, we have created the list named scores containing exam scores.

The bins are defined to represent different grading ranges: 0-60, 61-70, 71-80, 81-90, 91-100.

Then we used pd.cut() to categorize each score into the corresponding grading bin.


Example 2: Control Bin Boundaries Using right Argument in cut()

import pandas as pd

# create a list of data
data = [2, 4, 6, 8, 10]

# define the bins
bins = [0, 5, 10]

# use cut() with right=True (default) categories_right_true = pd.cut(data, bins, right=True)
print("Bins closed on the right:") print(categories_right_true) print()
# use cut with right=False categories_right_false = pd.cut(data, bins, right=False)
print("\nBins closed on the left:") print(categories_right_false)

Output

Bins closed on the right:
[(0, 5], (0, 5], (5, 10], (5, 10], (5, 10]]
Categories (2, interval[int64, right]): [(0, 5] < (5, 10]]

Bins closed on the left:
[[0.0, 5.0), [0.0, 5.0), [5.0, 10.0), [5.0, 10.0), NaN]
Categories (2, interval[int64, left]): [[0, 5) < [5, 10)]

Here, with

  1. right=True, the bins are (0, 5] and (5, 10], indicating that the right edge (5 and 10) is included in the bin.
  2. right=False, the bins are [0, 5) and [5, 10), meaning the left edge (0 and 5) is included in the bin.

Example 3: Naming Bins in Pandas cut()

import pandas as pd

# create a list of data
data = [20, 35, 45, 60, 75, 90]

# define the bins
bins = [0, 25, 50, 75, 100]

# define custom labels for the bins
labels = ['Low', 'Medium', 'High', 'Very High']

# use cut() with custom labels categories_with_labels = pd.cut(data, bins, labels=labels)
print(categories_with_labels)

Output

['Low', 'Medium', 'Medium', 'High', 'High', 'Very High']
Categories (4, object): ['Low' < 'Medium' < 'High' < 'Very High']

In this example, we have defined the list of custom labels: Low, Medium, High, and Very High, corresponding to each bin.

Then used pd.cut() to categorize the data into bins and assign the custom labels to these bins.


Example 4: Extract Bin Information Using retbins Argument in cut()

import pandas as pd

# create a list of data
data = [10, 15, 20, 25, 30, 35, 40]

# define the bins
bins = [0, 20, 40]

# use cut() with retbins=True categories, bin_edges = pd.cut(data, bins, retbins=True)
print("Binned Categories:") print(categories) print("\nBin Edges:") print(bin_edges)

Output

Binned Categories:
[(0, 20], (0, 20], (0, 20], (20, 40], (20, 40], (20, 40], (20, 40]]
Categories (2, interval[int64, right]): [(0, 20] < (20, 40]]

Bin Edges:
[ 0 20 40]

In the above example, we used pd.cut() with retbins=True, so it returns two things: the binned categories and the array of bin edges.

The categories variable contains the binned data (each element of data categorized into the bins).

And the bin_edges variable contains the actual edges of the bins used in the process.


Example 5: Specify the precision of the Labels of the Bins

import pandas as pd

# create a list of floating-point data
data = [10.123, 15.456, 20.789, 25.012, 30.345, 35.678, 40.901]

# define the bins
bins = [0, 20, 40, 60]

# use cut() with precision=2 categories = pd.cut(data, bins, precision=2)
print("Binned Categories with Two Decimal Precision:") print(categories)

Output

Binned Categories with Two Decimal Precision:
[(0, 20], (0, 20], (20, 40], (20, 40], (20, 40], (20, 40], (40, 60]]
Categories (3, interval[int64, right]): [(0, 20] < (20, 40] < (40, 60]]

Here, we used pd.cut() with precision=2. This means that the labels of the bins will be formatted to have two decimal places.


Example 6: Use of include_lowest Argument in cut()

import pandas as pd

# create a list of data
data = [20, 22, 24, 26, 28, 30]

# define the bins
bins = [20, 25, 30]

# use cut() with include_lowest=False (default) categories_default = pd.cut(data, bins)
print("First bin exclusive of the lower edge:") print(categories_default) print()
# use cut() with include_lowest=True categories_include_lowest = pd.cut(data, bins, include_lowest=True)
print("\nFirst bin inclusive of the lower edge:") print(categories_include_lowest)

Output

First bin exclusive of the lower edge:
[NaN, (20.0, 25.0], (20.0, 25.0], (25.0, 30.0], (25.0, 30.0], (25.0, 30.0]]
Categories (2, interval[int64, right]): [(20, 25] < (25, 30]]

First bin inclusive of the lower edge:
[(19.999, 25.0], (19.999, 25.0], (19.999, 25.0], (25.0, 30.0], (25.0, 30.0], (25.0, 30.0]]
Categories (2, interval[float64, right]): [(19.999, 25.0] < (25.0, 30.0]]

In this example, with

  • include_lowest=False - the first bin (20, 25] does not include the lower edge 20. Thus, the value 20 in the data is not included in any bin, resulting in NaN.
  • include_lowest=True - the first bin [20, 25] is inclusive of the lower edge 20. Therefore, the value 20 is included in the first bin, and there are no NaN values.