pandas is a powerful Python open-source library for performing Exploratory Data Analysis (EDA) tasks and manipulating large datasets. It provides efficient tools for slicing, filtering, aggregating, and pivoting data.
Overall, pandas is an essential tool for anyone working with data, whether they are a data scientist, analyst, or researcher.
Data slicing is a powerful technique that simplifies the analysis of large and complex datasets. This technique breaks down massive amounts of data into smaller, more manageable subsets, enabling us to extract meaningful insights more efficiently. By focusing on specific subsets of data, data slicing helps identify specific patterns and trends, facilitating the elimination of noise and irrelevant data.
To begin slicing data in pandas, the first step is to import the pandas library. Once imported, we will create a sample DataFrame df with four rows and four columns.
The code below imports the pandas library into Python and creates a sample data to slice:
import pandas as pddf = pd.DataFrame({"a": [1, 2, 3, 4],"b": [2, 3, 4, 5],"c": [3, 4, 5, 6],"d": [4, 5, 6, 7]})print(df)
Line 1: We import the pandas library.
Lines 2–5: We create a sample DataFrame df by calling the DataFrame() method from pandas.
Line 6: We print the sample DataFrame to the console using the print() statement.
After creating the sample DataFrame, there are several techniques available in pandas to perform slicing operations. These include using reindex, the [] notation, and the .loc[] and .iloc[]methods. Each of these methods has its own benefits and limitations, depending on the specific requirements of the data analysis task. We'll explore each of these techniques in detail and demonstrate how they can be used effectively to slice columns in the DataFrame.
reindex Slicing a column using reindex can be useful in situations where we want to rearrange the order of the rows and columns in a DataFrame based on a specific column, or if we want to select only certain columns and keep their order intact.
The code below selects the column b from the original DataFrame df and stores it to the new DataFrame df_slice.
df_slice = df.reindex(columns = ['b'])print(df_slice)
Line 1: We create a new variable df_slice to store the subset of the DataFrame from the original DataFrame df by using the reindex method. The columns parameter of the reindex method is set to ['b'], which means that the new DataFrame df_slice will only contain the b column from the original DataFrame.
Line 2: We print the new DataFrame df_slice to the console using the print() statement.
reindex Slicing multiple columns using reindex can be useful in situations where we want to extract multiple columns and retain their original order.
The code below selects the columns c and a from the original DataFrame df and stores them to the new DataFrame df_slice.
df_slice = df.reindex(columns = ['c','a'])print(df_slice)
Line 1: We create a new variable df_slice to store the subset of the DataFrame from the original DataFrame df by using the reindex method. The columns parameter of the reindex method is set to ['c','a'], which means that the new DataFrame df_slice will contain two columns, c and a, from the original DataFrame.
Line 2: We print the new DataFrame df_slice to the console using the print() statement.
[ ] notationWith this simple method, we can use the [ ] single notation for 1-d arrays and the [[ ]] double notation for 2-d arrays, and pass the column's name as a string.
The code below selects the columns c and d using the indexing system from the original DataFrame df and stores them to the new DataFrame df_slice.
df_slice = df[['c','d']]print (df_slice)
Line 1: We create a new variable df_slice to store the subset of the DataFrame from the original DataFrame df by using the [] method. The columns parameter is set to ['c','d'], which means that the new DataFrame df_slice will contain two columns, c and d, from the original DataFrame.
Line 2: We print the new DataFrame df_slice to the console using the print() statement.
.loc[ ] method with step size 2The pandas library includes a method called .loc[ ] that enables the indexing-based slicing of a DataFrame. With this method, we can access a specific group of rows and columns from a DataFrame using their labels.
The code below creates a new DataFrame df_slice by selecting the columns a and d from the original DataFrame df, using the loc indexing syntax with a step size of 2.
df_slice = df.loc[:, 'a':'d':2]print(df_slice)
Line 1: We create a new variable named df_slice to store a subset of a pandas DataFrame. The : on the left side of the comma specifies that we want to select all the rows of the DataFrame, and 'a':'d':2 on the right side of the comma specifies that we want to select columns with labels between, and including, a and d but only for every second column.
Line 2: We print the subset of the original DataFrame, which contains only the columns a and d.
.iloc[]method with step size 1pandas also includes a method called .iloc[] that allows indexing-based slicing of a DataFrame. This method is particularly helpful when the DataFrame has an index label that is not a numeric or when the user is unsure about the index label.
The code below creates a new DataFrame df_slice by selecting columns 0, 1, and 2 from the original DataFrame df, using the .iloc indexing syntax with a step size of 1.
df_slice = df.iloc[:,0:3:1]print(df_slice)
Line 1: We create a new variable named df_slice to store a subset of a pandas DataFrame. The : on the left side of the comma specifies that we want to select all the rows of the DataFrame, and 0:3:1 on the right side of the comma specifies that we want to select columns with integer positions between 0 (inclusive) and 3 (exclusive), in steps of 1.
Line 2: We print the subset of the original DataFrame, which contains only the first three columns.