How to load an orc file in pandas

What is ORC Format?

The ORC stands for Optimized Row Columnar. ORC is a highly efficient row columnar data format used to read, write, and process data in Hive. ORC files are made of data stripes, each of which comprises an index, row data, and a footer.

The read_orc method is used to load an orc file to a DataFrame.

Note: Refer to What is pandas in Python to learn more about pandas.

Syntax

pandas.read_orc(path, columns=None, **kwargs)

Parameter

  • path: This is the location/path of the orc file. A directory with many files can be referenced by the file path. The file path can also be a legitimate file URL. The acceptable URL schemes are http, ftp, s3, gs, and file.
  • columns: These are the columns to be read into the DataFrame.

Code example

Let’s look at the code below:

import pandas as pd
import pyarrow.orc
# Creating an orc file
df = pd.DataFrame(data={"Name": ["John", "Kelly"], "Age": [3, 4]})
df.to_orc("./df.orc")
# Reading an orc file
df = pd.read_orc("df.orc")
print(df)
# Selecting a column from an orc file
cols = ["Name"]
df1 = pd.read_orc("df.orc", columns=cols)
print(df1)

Code explanation

  • Lines 1-2 : pandas and pyarrow packages are imported.
  • Lines 4-5 : A DataFrame is created and written to a file named df.oc
  • Line 8: The df.orc file is read into a pandas data frame called df using the read_orc method.
  • Line 9: The df is printed.
  • Line 12: We define the columns, cols, to be read into the data frame.
  • Line 13: The df.orc file is read into a pandas DataFrame called df1 using the read_orc method and passing cols as the columns to be read, rejecting other columns.
  • Line 14: The df1 is printed.

Free Resources