Polars DataFrame is a data structure for efficient, fast data manipulation, similar to pandas, written in Rust.
Key takeaways:
DataFrame.partition_by()
in Polars splits DataFrames based on column values.It accepts up to five parameters:
by
,more_by
,maintain_order
,include_key
, andas_dict
.The result can be a list or dictionary of partitioned DataFrames.
Using
maintain_order=False
randomizes the order of the data in the resulting partitions.Setting
as_dict=True
returns the partitioned DataFrames as a dictionary.Partitioning by multiple columns creates more refined partitions based on the unique combinations of values.
The function is efficient for data processing, filtering, and parallel tasks on large datasets.
Polars is a library written in Rust, inspired by pandas
, for efficient and fast data frame manipulation. DataFrame.partition_by()
is a new function implemented in the library for creating separate DataFrames, based on column value. Let’s look into the details of the function.
DataFrame.partition_by()
functionThis function takes a maximum of 5 parameters and returns a list or a dictionary.
df.partition_by(by, more_by, maintain_order, include_key, as_dict)
by
: This parameter specifies the column name to group the dataset.
more_by
: This is an optional argument specifying additional column names to group the dataset.
maintain_order
: This is an optional argument ensuring the result is in the same order as the input data. The default bool
value is True.
include_key
: This is an optional argument specifying whether to include the column(s) used to group by. The default bool
value is True.
as_dict
: This is an optional argument specifying whether to return the result as a dictionary. The default bool
value is False
.
Returns
list
: A list of data frames partitioned by the specified column name.
dict
: A dictionary of DataFrames partitioned by the specified column name.
Let’s start by importing the Polars library.
import polars as pl
Next, we can define a simple data frame about the different types of fruits in a supermarket and how ripe they are.
Fruits | Level of Ripeness |
Apples | 1 |
Grapes | 2 |
Bananas | 2 |
Apples | 3 |
Bananas | 1 |
Grapes | 3 |
Make a DataFrame for this table using the Polars library.
df = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3]})
Let’s see how the df.partition_by()
will work if we partition it by the “Fruits” column.
#Import polars library as plimport polars as pl# Create our DataFramedf = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3]})# Partition the dataframe based on "Fruits"print("Dataset after partioning: ")partioned_df = df.partition_by("Fruits")print(partioned_df)
Here is a line-by-line breakdown of the code above.
Line 2: We import the polar
library as pl
.
Lines 5–8: Here, we create our DataFrame df
using the function pl.dataframe
provided by polars
library. We give the function a dictionary
as input.
Line 14: This line partitions the dataset into three different DataFrames since we are partitioning the dataset based on the number of unique values in the column "Fruits"
. Therefore, our result is three different data frames with one unique fruit.
Let’s look at the impact of changing the function parameters.
maintain_order = False
In the code below, we will partition the dataset by the “Fruits” column, along with making maintain_order
as False
.
partioned_df = df.partition_by("Fruits", maintain_order = False)
Due to this, the order of the column “Fruits” is not maintained, and the resulting data frames are in random order.
as_dict = True
In the code below, we will partition the dataset by the “Fruits” column, along with making as_dict
as True.
partioned_df = df.partition_by(['Fruits'], as_dict = True)
If we want to return our data frames in the form of a dictionary, we can make the as_dict
parameter as True
. However, due to a deprecation warning, we input our column as a list
.
To do this, let’s append another Price column to our data frame. We have to do this because if we partition a dataset containing two columns into two columns, our resulting answer will be a list of empty DataFrames.
Therefore, let’s add another column called “Price,” which contains the price of the fruit based on its ripeness level.
#Import polars library as plimport polars as pl# Create our DataFramedf = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3],"Price in $": [4, 5, 2, 1, 3, 2]})# Partition the dataframe based on "Fruits"print("Dataset after partioning: ")partioned_df = df.partition_by('Fruits', 'Level of Ripeness', include_key = False)print(partioned_df)
The partitioned data frames have increased to 6, as the data frame is grouped by the “Fruits” column and the “Level of Ripeness” column. Moreover, since we made include_key = False
, the result does not include the “Fruits” and “Level of Ripeness” columns.
The df.partition_by()
is a helpful function used in data processing tasks such as data filtering, grouping, aggregation, parallel processing, and more. It is simple to understand, efficient, and works for large datasets!
Haven’t found what you were looking for? Contact Us
Free Resources