pandas is a powerful Python open-source library for performing Exploratory Data Analysis (EDA) tasks and manipulating large datasets. It provides efficient tools for slicing, filtering, aggregating, and pivoting data.
Overall, pandas is an essential tool for anyone working with data, whether they are a data scientist, analyst, or researcher.
Data slicing is a powerful technique that simplifies the analysis of large and complex datasets. This technique breaks down massive amounts of data into smaller, more manageable subsets, enabling us to extract meaningful insights more efficiently. By focusing on specific subsets of data, data slicing helps identify specific patterns and trends, facilitating the elimination of noise and irrelevant data.
To begin slicing data in pandas, the first step is to import the pandas
library. Once imported, we will create a sample DataFrame df
with four rows and four columns.
The code below imports the pandas
library into Python and creates a sample data to slice:
import pandas as pddf = pd.DataFrame({"a": [1, 2, 3, 4],"b": [2, 3, 4, 5],"c": [3, 4, 5, 6],"d": [4, 5, 6, 7]})print(df)
Line 1: We import the pandas
library.
Lines 2–5: We create a sample DataFrame df
by calling the DataFrame()
method from pandas
.
Line 6: We print the sample DataFrame to the console using the print()
statement.
After creating the sample DataFrame, there are several techniques available in pandas to perform slicing operations. These include using reindex
, the []
notation, and the .loc[]
and .iloc[]
methods. Each of these methods has its own benefits and limitations, depending on the specific requirements of the data analysis task. We'll explore each of these techniques in detail and demonstrate how they can be used effectively to slice columns in the DataFrame.
reindex
Slicing a column using reindex
can be useful in situations where we want to rearrange the order of the rows and columns in a DataFrame based on a specific column, or if we want to select only certain columns and keep their order intact.
The code below selects the column b
from the original DataFrame df
and stores it to the new DataFrame df_slice
.
df_slice = df.reindex(columns = ['b'])print(df_slice)
Line 1: We create a new variable df_slice
to store the subset of the DataFrame from the original DataFrame df
by using the reindex
method. The columns
parameter of the reindex
method is set to ['b']
, which means that the new DataFrame df_slice
will only contain the b
column from the original DataFrame.
Line 2: We print the new DataFrame df_slice
to the console using the print()
statement.
reindex
Slicing multiple columns using reindex
can be useful in situations where we want to extract multiple columns and retain their original order.
The code below selects the columns c
and a
from the original DataFrame df
and stores them to the new DataFrame df_slice
.
df_slice = df.reindex(columns = ['c','a'])print(df_slice)
Line 1: We create a new variable df_slice
to store the subset of the DataFrame from the original DataFrame df
by using the reindex
method. The columns
parameter of the reindex
method is set to ['c','a']
, which means that the new DataFrame df_slice
will contain two columns, c
and a
, from the original DataFrame.
Line 2: We print the new DataFrame df_slice
to the console using the print()
statement.
[ ]
notationWith this simple method, we can use the [ ]
single notation for 1-d arrays and the [[ ]]
double notation for 2-d arrays, and pass the column's name as a string.
The code below selects the columns c
and d
using the indexing system from the original DataFrame df
and stores them to the new DataFrame df_slice
.
df_slice = df[['c','d']]print (df_slice)
Line 1: We create a new variable df_slice
to store the subset of the DataFrame from the original DataFrame df
by using the []
method. The columns
parameter is set to ['c','d']
, which means that the new DataFrame df_slice
will contain two columns, c
and d
, from the original DataFrame.
Line 2: We print the new DataFrame df_slice
to the console using the print()
statement.
.loc[ ]
method with step size 2The pandas
library includes a method called .loc[ ]
that enables the indexing-based slicing of a DataFrame. With this method, we can access a specific group of rows and columns from a DataFrame using their labels.
The code below creates a new DataFrame df_slice
by selecting the columns a
and d
from the original DataFrame df
, using the loc
indexing syntax with a step size of 2
.
df_slice = df.loc[:, 'a':'d':2]print(df_slice)
Line 1: We create a new variable named df_slice
to store a subset of a pandas DataFrame. The :
on the left side of the comma specifies that we want to select all the rows of the DataFrame, and 'a':'d':2
on the right side of the comma specifies that we want to select columns with labels between, and including, a
and d
but only for every second column.
Line 2: We print the subset of the original DataFrame, which contains only the columns a
and d
.
.iloc[]
method with step size 1pandas also includes a method called .iloc[]
that allows indexing-based slicing of a DataFrame. This method is particularly helpful when the DataFrame has an index label that is not a numeric or when the user is unsure about the index label.
The code below creates a new DataFrame df_slice
by selecting columns 0, 1, and 2 from the original DataFrame df
, using the .
iloc
indexing syntax with a step size of 1
.
df_slice = df.iloc[:,0:3:1]print(df_slice)
Line 1: We create a new variable named df_slice
to store a subset of a pandas DataFrame. The :
on the left side of the comma specifies that we want to select all the rows of the DataFrame, and 0:3:1
on the right side of the comma specifies that we want to select columns with integer positions between 0
(inclusive) and 3
(exclusive), in steps of 1
.
Line 2: We print the subset of the original DataFrame, which contains only the first three columns.