How to load a parquet file in pandas

What is Parquet?

Apache Parquet is a column-oriented data file format that is open source and designed for data storage and retrieval. It offers high-performance data compression and encoding schemes for handling large amounts of complex data.

The read_parquet method is used to load a parquet file to a data frame.

Note: Refer to What is pandas in Python to learn more about pandas.

Syntax

Here’s the syntax for this:

pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False, **kwargs)

Parameter

  • path: The file path to the parquet file. The file path can also point to a directory containing multiple files. The file path can also be a valid file URL. Valid URL schemes are http, ftp, s3, gs, and file.
  • engine: This parameter indicates which parquet library to use. Available options are auto, pyarrow or fastparquet.
  • columns: This parameter indicates the columns to be read into the data frame.
  • storage_options: Extra options for a certain storage connection, such as host, port, username, password, and so on.
  • use_nullable_dtypes: This is a boolean parameter. If True, use types for the resultant data frame that uses pd.NA as the missing value indicator.

Code

Let’s see an example of the read_parquet method in Python.

import pandas as pd

df =  pd.read_parquet('data.parquet', engine='pyarrow')
print(df)

cols = ["Name"]
df1 = pd.read_parquet('data.parquet', columns=cols)
print(df1)

Explanation

  • Line 1: pandas library is imported.
  • Line 3: The parquet file data.parquet is loaded to a pandas data frame i.e., df using the read_parquet method.
  • Line 4: df is printed.
  • Line 6: We define the columns i.e., cols to be read into the data frame.
  • Line 7: data.parquet file is read into a pandas data frame called df1 using the read_parquet method and passing cols as the columns to be read rejecting other columns.
  • Line 8: df1 is printed.

Free Resources