What is the vaex library?

Vaex is a powerful Python library implemented in C++. It stands out for its ability to efficiently handle extensive datasets, making it a valuable tool for data scientists and analysts dealing with big data scenarios. This Answer will show how it can handle data, like pandas and Polars. Please visit the comparison between pandas and vaex for a detailed analysis.

Vaex logo
Vaex logo

Import the vaex library

Use the following command to import the library:

import vaex as vx

Create a DataFrame

The cornerstone of vaex is the DataFrameA DataFrame is a data structure made up of rows and columns, similar to a database or Excel spreadsheet. It is composed of a dictionary of lists where each list has its own identifiers or keys, like “first name” or “last name.”, representing a two-dimensional table. We can create a DataFrame effortlessly in vaex as follows:

import vaex as vx
data = {"column1": [1, 2, 3], "column2": ["foo", "bar", "baz"]}
df = vx.from_dict(data)
print(" === Vaex DataFrame === ")
print(df)

In the code above:

  • Line 1: Import the vaex library and name it as vx for convenience in the code.

  • Line 3: Define a Python dictionary named data, representing tabular data. It consists of two columns, "column1" with integer values and "column2" with string values.

  • Line 5: Use vaex’s from_dict method and create a DataFrame (i.e., df) from the provided dictionary (i.e., data). This DataFrame will have two columns, "column1" and "column2", with corresponding data.

  • Lines 6–7: Print the vaex DataFrame (i.e., df) to the console.

Basic operations

Some of the basic operations that we can perform using vaex in Python as follows:

# Select column(s): Choose specific column(s) with ease
df_new = df[["column1"]]
print(" === Select === ")
print(df_new)
print("\n")
# Filter rows: Apply conditions to filter rows
df_filtered = df[df["column1"] > 1]
print(" === Filter === ")
print(df_filtered)
print("\n")
# Group and aggregate: Aggregating data based on specific column(s)
df_grouped = df.groupby("column2").agg({"column1": "sum"})
print(" === Group === ")
print(df_grouped)
print("\n")
# Sort the DataFrame effortlessly
df_sorted = df.sort("column1")
print(" === Sort === ")
print(df_sorted)
print("\n")
# Join DataFrames: Perform various join operations
df1 = vx.from_dict({"key": ["Alpha", "Beta", "Gamma"], "value": [10, 20, 30]})
df2 = vx.from_dict({"key": ["Beta", "Gamma", "Delta"], "value": [40, 50, 60]})
df_joined = df1.join(df2, on="key", rsuffix='_df2')
print(" === Join === ")
print(df_joined)

In the code above:

  • Lines 1–5: Select only the "column1" from the existing DataFrame (i.e., df). It creates a new DataFrame (df_new) with only the specified column. The result is then printed.

  • Lines 7–11: Filter rows based on a condition. It creates a new DataFrame (i.e., df_filtered) containing only the rows where the value in "column1" is greater than 1. The result is then printed.

  • Lines 13–17: Group the DataFrame by "column2" and aggregate the values in "column1" using the sum. The result is stored in df_grouped and then it is printed.

  • Lines 19–23: Sort the DataFrame based on the values in "column1" in ascending order. The result is stored in df_sorted and then then it is printed.

  • Lines 25–32: Create two additional DataFrames (i.e., df1 and df2), and then perform the join operation on "key" column. The rsuffix='_df2' parameter is added to the join method. This specifies a suffix ('_df2') for the columns from the second DataFrame (i.e., df2). The result is stored in df_joined and then it is printed.

Conclusion

In conclusion, vaex is a swift and memory-efficient data analysis library in Python, making it a valuable addition to the data science toolkit. Keep on exploring its versatility for diverse tasks, from basic manipulations to complex analytics.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved