What is the DataFrame.partition_by() method in Polars?

Key takeaways:
DataFrame.partition_by() in Polars splits DataFrames based on column values.
It accepts up to five parameters: by, more_by, maintain_order, include_key, and as_dict.
The result can be a list or dictionary of partitioned DataFrames.
Using maintain_order=False randomizes the order of the data in the resulting partitions.
Setting as_dict=True returns the partitioned DataFrames as a dictionary.
Partitioning by multiple columns creates more refined partitions based on the unique combinations of values.
The function is efficient for data processing, filtering, and parallel tasks on large datasets.

Polars is a library written in Rust, inspired by pandas, for efficient and fast data frame manipulation. DataFrame.partition_by() is a new function implemented in the library for creating separate DataFrames, based on column value. Let’s look into the details of the function.

The `DataFrame.partition_by()` function

This function takes a maximum of 5 parameters and returns a list or a dictionary.

Syntax

Parameters

by: This parameter specifies the column name to group the dataset.
more_by: This is an optional argument specifying additional column names to group the dataset.
maintain_order: This is an optional argument ensuring the result is in the same order as the input data. The default bool value is True.
include_key: This is an optional argument specifying whether to include the column(s) used to group by. The default bool value is True.
as_dict: This is an optional argument specifying whether to return the result as a dictionary. The default bool value is False.

Returns

list: A list of data frames partitioned by the specified column name.
dict: A dictionary of DataFrames partitioned by the specified column name.

Code

Let’s start by importing the Polars library.

Explanation

Here is a line-by-line breakdown of the code above.

Line 2: We import the polar library as pl.
Lines 5–8: Here, we create our DataFrame df using the function pl.dataframe provided by polars library. We give the function a dictionary as input.
Line 14: This line partitions the dataset into three different DataFrames since we are partitioning the dataset based on the number of unique values in the column "Fruits". Therefore, our result is three different data frames with one unique fruit.

Let’s look at the impact of changing the function parameters.

Using `maintain_order = False`

In the code below, we will partition the dataset by the “Fruits” column, along with making maintain_order as False.

If we want to return our data frames in the form of a dictionary, we can make the as_dict parameter as True . However, due to a deprecation warning, we input our column as a list.

Partitioning by multiple columns

To do this, let’s append another Price column to our data frame. We have to do this because if we partition a dataset containing two columns into two columns, our resulting answer will be a list of empty DataFrames.

Therefore, let’s add another column called “Price,” which contains the price of the fruit based on its ripeness level.

The partitioned data frames have increased to 6, as the data frame is grouped by the “Fruits” column and the “Level of Ripeness” column. Moreover, since we made include_key = False , the result does not include the “Fruits” and “Level of Ripeness” columns.

Conclusion

The df.partition_by() is a helpful function used in data processing tasks such as data filtering, grouping, aggregation, parallel processing, and more. It is simple to understand, efficient, and works for large datasets!

Frequently asked questions

Haven’t found what you were looking for? Contact Us

What is Polars DataFrame?

Polars DataFrame is a data structure for efficient, fast data manipulation, similar to pandas, written in Rust.

Are Polars faster than pandas?

Yes, Polars is generally faster than pandas, especially for large datasets, due to its multi-threaded processing and optimized memory usage.

How to define a schema in Polars?

A schema in Polars can be defined by using the pl.DataFrame function and passing a dictionary where the keys are column names and the values are lists of data or using the schema parameter to specify data types explicitly.