Pandas sample()

The sample() method in Pandas is used to randomly select a specified number of rows from a DataFrame.

Example

import pandas as pd

# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 22, 35, 28]}

df = pd.DataFrame(data)

# select 2 random rows from the DataFrame sampled_rows = df.sample(n=2)
print(sampled_rows) ''' Output Name Age 4 Eva 28 0 Alice 25 '''

sample() Syntax

The syntax of the sample() method in Pandas is:

df.sample(n=None, frac=None, replace=False, weights=None, random_state=None)

sample() Arguments

The sample() method takes following arguments:

  • n (optional) - specifies the number of random samples to select
  • frac (optional) - specifies the fraction of the DataFrame to sample (between 0 and 1)
  • replace (optional) - a boolean that determines if sampling should be with replacement or not
  • weights (optional) - allows assigning different probabilities to rows for weighted sampling
  • random_state (optional) - an integer for controlling randomness.

sample() Return Value

The sample() method returns a new DataFrame containing the randomly selected rows or columns from the original DataFrame.


Example 1: Select Random Rows Using sample()

import pandas as pd

# create a DataFrame 
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 32, 28, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)

# select 3 random rows sampled_rows = df.sample(n=3)
print("Selected 3 random rows:") print(sampled_rows)

Output

Selected 3 random rows:
    Name  Age         City
3  David   22      Houston
4    Eve   30        Miami
1    Bob   32  Los Angeles

In the above example, we have used the sample() method with n=3 to randomly select 3 rows from the df DataFrame.

The sampled_rows variable contains those 3 randomly selected rows from df.

Note: Since sample() randomly selects rows, the output will be different each time we execute the code.


Example 2: Select Fraction of Rows Randomly

import pandas as pd

# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 32, 28, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)

# select 30% of the rows randomly sampled_fraction = df.sample(frac=0.3)
print("Selected 30% of the rows randomly:") print(sampled_fraction)

Output

Selected 30% of the rows randomly:
     Name  Age      City
0  Alice    25     New York
4    Eve    30      Miami

Here, inside sample() we used the frac parameter with a value of 0.3 to randomly select 30% of the rows from the df DataFrame.

The sampled_fraction variable contains that random subset of rows.


Example 3: Sample With Replacement in Pandas

Sample with replacement simply means to allow the same row to be selected multiple times.

import pandas as pd

# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 32, 28, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)

# sample 5 random rows with replacement sampled_with_replacement = df.sample(n=5, replace=True)
print("Sampled with replacement (allowing duplicates):") print(sampled_with_replacement)

Output

Sampled with replacement (allowing duplicates):
     Name  Age    City
1    Bob   32  Los Angeles
1    Bob   32  Los Angeles
0  Alice   25     New York
3  David   22      Houston
1    Bob   32  Los Angeles

In this example, we set replace=True when using the sample() method with n=5.

This allows the same row to be selected multiple times in the sampled output, effectively creating duplicates in the result.


Example 4: Control Randomness With random_state Argument in sample()

import pandas as pd

# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 32, 28, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)

# select 3 random rows with a specific random seed for reproducibility sampled_with_seed = df.sample(n=3, random_state=42)
print("Sampled with a specific random seed (for reproducibility):") print(sampled_with_seed)

Output

Sampled with a specific random seed (for reproducibility):
       Name  Age     City
1      Bob   32  Los Angeles
4      Eve   30        Miami
2   Charlie 28      Chicago

Here, we set random_state=42 when using the sample() method to sample 3 random rows.

Setting a specific random seed (in our case, 42) ensures that the same random sample is generated whenever we use this seed.

This is useful when we want to reproduce the same random sample in different runs of your code.

Note: The choice of 42 as the seed value is arbitrary; we can use any integer value we like.


Example 5: Perform Weighted Sampling for Biased Data Selection

import pandas as pd

# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 32, 28, 22, 30],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)

# assign weights to each row weights_list = [0.1, 0.2, 0.3, 0.2, 0.2] # sample 2 random rows with weights weighted_sample = df.sample(n=2, weights=weights_list)
print("Weighted sampling:") print(weighted_sample)

Output

Weighted sampling:
     Name  Age   City
2  Charlie 28  Chicago
4  Eve     30  Miami

In the above example, we have defined the list called weights_list, which contains weight values for each row. These weights represent the probabilities of each row being selected during the weighted sampling.

Then we used sample() with n set to 2 and the weights parameter set to the weights_list list.

Here, we performed weighted sampling and selected 2 random rows from the df DataFrame, considering the specified weights.