A dataframe can be split according to boolean criteria using the method called boolean masking.
Boolean masking or boolean indexing is the process in which subsets of the dataframe are extracted using a boolean vector.
Let’s understand this concept with an example.
Consider the following DataFrame.
import pandas as pdrecords = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]df = pd.DataFrame(records)print(df)
pandas
module is imported.records
for the dataframe is defined.records
.The dataset is a student dataset that contains student name, their GPA, and the country they belong to.
Now if we want to split the dataset into students belonging to the USA and not belonging to the USA, we can use a boolean mask as follows:
mask = df['country'] == 'USA'
The mask above can be used to get all students from the USA. In order to get all students, not from the USA, we should negate the mask above i.e. ~mask
.
import pandas as pdrecords = [{"student_name":"Maya Wells","gpa":4.5,"country":"USA"},{"student_name":"Olympia Woods","gpa":5.9,"country":"Australia"},{"student_name":"Kenneth Oneal","gpa":8.5,"country":"Germany"},{"student_name":"Tobias Garcia","gpa":3.0,"country":"Ukraine"},{"student_name":"Micah Mcgee","gpa":9.0,"country":"Austria"},{"student_name":"John Mack","gpa":5.0,"country":"USA"},{"student_name":"Jack Daniels","gpa":6.7,"country":"Australia"},{"student_name":"Sarah Daniels","gpa":1.3,"country":"Australia"},{"student_name":"John Wick","gpa":10.0,"country":"USA"},{"student_name":"Zelensky","gpa":1.0,"country":"Ukraine"},{"student_name":"Jack Som","gpa":8.6,"country":"Austria"}]df = pd.DataFrame(records)mask = df['country'] == 'USA'students_from_usa = df[mask]students_not_from_usa = df[~mask]print("Students from USA\n", students_from_usa)print("-"* 5)print("Students not from USA\n", students_not_from_usa)
pandas
module is imported.records
.mask
as the country column equals USA
.mask
.mask
.Free Resources