How to split a DataFrame according to a boolean criterion

Overview

A dataframe can be split according to boolean criteria using the method called boolean masking.

Boolean masking or boolean indexing is the process in which subsets of the dataframe are extracted using a boolean vector.

Let’s understand this concept with an example.

DataFrame

Consider the following DataFrame.

import pandas as pd
records = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]
df = pd.DataFrame(records)
print(df)

The dataset is a student dataset that contains student name, their GPA, and the country they belong to.

Now if we want to split the dataset into students belonging to the USA and not belonging to the USA, we can use a boolean mask as follows:

mask = df['country'] == 'USA'

The mask above can be used to get all students from the USA. In order to get all students, not from the USA, we should negate the mask above i.e. ~mask.

Splitting a DataFrame

import pandas as pd
records = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]
df = pd.DataFrame(records)
mask = df['country'] == 'USA'
students_from_usa = df[mask]
students_not_from_usa = df[~mask]
print("Students from USA\n", students_from_usa)
print("-"* 5)
print("Students not from USA\n", students_not_from_usa)

How to split a DataFrame according to a boolean criterion

Overview

DataFrame

Explanation

Splitting a DataFrame

Explanation