How to split a DataFrame according to a boolean criterion

Overview

A dataframe can be split according to boolean criteria using the method called boolean masking.

Boolean masking or boolean indexing is the process in which subsets of the dataframe are extracted using a boolean vector.

Let’s understand this concept with an example.

DataFrame

Consider the following DataFrame.

import pandas as pd
records = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]
df = pd.DataFrame(records)
print(df)

Explanation

  • Line 1: pandas module is imported.
  • Line 3: Sample records for the dataframe is defined.
  • Line 5: A pandas dataframe is created from the sample records.

The dataset is a student dataset that contains student name, their GPA, and the country they belong to.

Now if we want to split the dataset into students belonging to the USA and not belonging to the USA, we can use a boolean mask as follows:

mask = df['country'] == 'USA'

The mask above can be used to get all students from the USA. In order to get all students, not from the USA, we should negate the mask above i.e. ~mask.

Splitting a DataFrame

import pandas as pd
records = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]
df = pd.DataFrame(records)
mask = df['country'] == 'USA'
students_from_usa = df[mask]
students_not_from_usa = df[~mask]
print("Students from USA\n", students_from_usa)
print("-"* 5)
print("Students not from USA\n", students_not_from_usa)

Explanation

  • Line 1: pandas module is imported.
  • Line 3: Sample records for the DataFrame is defined.
  • Line 5: A pandas DataFrame is created from the sample records.
  • Line 7: We define the mask as the country column equals USA.
  • Line 9: We get the students from the USA with the help of the mask.
  • Line 11: We get the students not from the USA by negating the mask.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved