What is the DataFrame.partition_by() method in Polars?

Key takeaways:

  • DataFrame.partition_by() in Polars splits DataFrames based on column values.

  • It accepts up to five parameters: by, more_by, maintain_order, include_key, and as_dict.

  • The result can be a list or dictionary of partitioned DataFrames.

  • Using maintain_order=False randomizes the order of the data in the resulting partitions.

  • Setting as_dict=True returns the partitioned DataFrames as a dictionary.

  • Partitioning by multiple columns creates more refined partitions based on the unique combinations of values.

  • The function is efficient for data processing, filtering, and parallel tasks on large datasets.

Polars is a library written in Rust, inspired by pandas, for efficient and fast data frame manipulation. DataFrame.partition_by() is a new function implemented in the library for creating separate DataFrames, based on column value. Let’s look into the details of the function.

The DataFrame.partition_by() function

This function takes a maximum of 5 parameters and returns a list or a dictionary.

Syntax

df.partition_by(by, more_by, maintain_order, include_key, as_dict)
Syntax of the function.

Parameters

  • by: This parameter specifies the column name to group the dataset.

  • more_by: This is an optional argument specifying additional column names to group the dataset.

  • maintain_order: This is an optional argument ensuring the result is in the same order as the input data. The default bool value is True.

  • include_key: This is an optional argument specifying whether to include the column(s) used to group by. The default bool value is True.

  • as_dict: This is an optional argument specifying whether to return the result as a dictionary. The default bool value is False.

Returns

  • list: A list of data frames partitioned by the specified column name.

  • dict: A dictionary of DataFrames partitioned by the specified column name.

Code

Let’s start by importing the Polars library.

import polars as pl

Next, we can define a simple data frame about the different types of fruits in a supermarket and how ripe they are.

Fruits

Level of Ripeness

Apples

1

Grapes

2

Bananas

2

Apples

3

Bananas

1

Grapes

3

Make a DataFrame for this table using the Polars library.

df = pl.DataFrame(
{
"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],
"Level of Ripeness": [1, 2, 2, 3, 1, 3]
}
)
Creation of dataframe of Fruits and their levels of ripeness.

Let’s see how the df.partition_by() will work if we partition it by the “Fruits” column.

#Import polars library as pl
import polars as pl
# Create our DataFrame
df = pl.DataFrame(
{
"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],
"Level of Ripeness": [1, 2, 2, 3, 1, 3]
}
)
# Partition the dataframe based on "Fruits"
print("Dataset after partioning: ")
partioned_df = df.partition_by("Fruits")
print(partioned_df)

Explanation

Here is a line-by-line breakdown of the code above.

  • Line 2: We import the polar library as pl.

  • Lines 5–8: Here, we create our DataFrame df using the function pl.dataframe provided by polars library. We give the function a dictionary as input.

  • Line 14: This line partitions the dataset into three different DataFrames since we are partitioning the dataset based on the number of unique values in the column "Fruits". Therefore, our result is three different data frames with one unique fruit.

Let’s look at the impact of changing the function parameters.

Using maintain_order = False

In the code below, we will partition the dataset by the “Fruits” column, along with making maintain_order as False.

partioned_df = df.partition_by("Fruits", maintain_order = False)

Due to this, the order of the column “Fruits” is not maintained, and the resulting data frames are in random order.

Using as_dict = True

In the code below, we will partition the dataset by the “Fruits” column, along with making as_dict as True.

partioned_df = df.partition_by(['Fruits'], as_dict = True)

If we want to return our data frames in the form of a dictionary, we can make the as_dict parameter as True . However, due to a deprecation warning, we input our column as a list.

Partitioning by multiple columns

To do this, let’s append another Price column to our data frame. We have to do this because if we partition a dataset containing two columns into two columns, our resulting answer will be a list of empty DataFrames.

Therefore, let’s add another column called “Price,” which contains the price of the fruit based on its ripeness level.

#Import polars library as pl
import polars as pl
# Create our DataFrame
df = pl.DataFrame(
{
"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],
"Level of Ripeness": [1, 2, 2, 3, 1, 3],
"Price in $": [4, 5, 2, 1, 3, 2]
}
)
# Partition the dataframe based on "Fruits"
print("Dataset after partioning: ")
partioned_df = df.partition_by('Fruits', 'Level of Ripeness', include_key = False)
print(partioned_df)

The partitioned data frames have increased to 6, as the data frame is grouped by the “Fruits” column and the “Level of Ripeness” column. Moreover, since we made include_key = False , the result does not include the “Fruits” and “Level of Ripeness” columns.

Conclusion

The df.partition_by() is a helpful function used in data processing tasks such as data filtering, grouping, aggregation, parallel processing, and more. It is simple to understand, efficient, and works for large datasets!

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is Polars DataFrame?

Polars DataFrame is a data structure for efficient, fast data manipulation, similar to pandas, written in Rust.


Are Polars faster than pandas?

Yes, Polars is generally faster than pandas, especially for large datasets, due to its multi-threaded processing and optimized memory usage.


How to define a schema in Polars?

A schema in Polars can be defined by using the pl.DataFrame function and passing a dictionary where the keys are column names and the values are lists of data or using the schema parameter to specify data types explicitly.


Are Polars DataFrames immutable?

Yes, Polars DataFrames are immutable, meaning once created, they cannot be modified in place, and any operation returns a new DataFrame.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved