The ORC
stands for Optimized Row Columnar. ORC is a highly efficient row columnar data format used to read, write, and process data in Hive. ORC files are made of data stripes, each of which comprises an index, row data, and a footer.
The read_orc
method is used to load an orc file to a DataFrame.
Note: Refer to What is pandas in Python to learn more about pandas.
pandas.read_orc(path, columns=None, **kwargs)
path
: This is the location/path of the orc file. A directory with many files can be referenced by the file path. The file path can also be a legitimate file URL. The acceptable URL schemes are http
, ftp
, s3
, gs
, and file
.columns
: These are the columns to be read into the DataFrame.Let’s look at the code below:
import pandas as pdimport pyarrow.orc# Creating an orc filedf = pd.DataFrame(data={"Name": ["John", "Kelly"], "Age": [3, 4]})df.to_orc("./df.orc")# Reading an orc filedf = pd.read_orc("df.orc")print(df)# Selecting a column from an orc filecols = ["Name"]df1 = pd.read_orc("df.orc", columns=cols)print(df1)
pandas
and pyarrow
packages are imported.df.oc
df.orc
file is read into a pandas data frame called df
using the read_orc
method.df
is printed.cols
, to be read into the data frame.df.orc
file is read into a pandas DataFrame called df1
using the read_orc
method and passing cols
as the columns to be read, rejecting other columns.df1
is printed.