The use of dropna()
in pandas is to remove rows or columns containing missing (NaN) values to clean the data for analysis.
Key takeaways:
dropna()
in pandas removes rows or columns with missing values (NaN).The function enhances data quality by ensuring only complete data is used for analysis.
Parameters like
axis
,how
,thresh
,subset
, andinplace
allow flexible data cleaning.The default behavior removes rows with any missing value.
Specifying
axis=1
drops columns with NaN values.Using
how="all"
removes rows or columns where all values are NaN.The
thresh
parameter keeps rows or columns with minimum non-NaN values.The
subset
parameter targets specific columns or rows for NaN removal.The
inplace=True
option modifies the original DataFrame without returning a new one.
Missing value treatment is a very crucial part of data cleaning and tidying for integrity in data analysis, as these missing values create gaps that distort your outcomes and might lead to misinterpretation or even failure to draw a sound conclusion. pandas provides the dropna()
function for effectively handling this; the function drops either the rows or columns that contain missing values, which, in other words, are NaN values. This makes statistical analysis and the predictive model of better quality, and makes data visualizations also cleaner. You can effectively apply dropna()
to ensure your data is clean and analysis-ready.
Let’s explore how this function works with the help of a diagram.
In the above diagram, we apply the dropna()
function to a DataFrame without specifying any parameters. This triggers the function’s default behavior, which, in this case, removes any row harbouring a NaN value. Consequently, a refreshed DataFrame is produced, devoid of the previously identified incomplete row. This visual example underscores the function’s straightforward yet effective approach to cleansing data, ensuring that only complete and reliable records are retained for analysis.
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
All the parameters passed to the function are optional; their default values are as provided in the above function syntax. Let’s try to understand the parameters we can pass to the function.
The dropna()
function in pandas is highly customizable through various parameters, enabling precise control over handling missing values. Below is a detailed breakdown of these parameters:
axis
: Dictates the axis along which the function identifies and drops missing values.
axis=0
(default): Drops the rows with missing values.
axis=1
: Drops the columns with missing values.
how
: Defines the criterion for dropping rows or columns based on missing values.
any
(default): Drops an axis if any NaN values are found.
all
: Drops an axis only if all values are NaN.
thresh
: Sets the minimum number of non-missing values required to retain a row or column.
subset
: Allows specification of a subset of columns or rows to examine for missing values.
inplace
: Decides whether to modify the DataFrame in place or return a new DataFrame.
True
: Alters the original DataFrame.
False
(default): Generates a new DataFrame with the changes applied.
These parameters give you the flexibility to tailor the dropna()
function to fit the specific needs of your data cleansing process.
Let’s try to understand the dropna()
function and each of its parameters with the help of code:
The import statement and dataset are hidden after the first code snippet in all subsequent code snippets to facilitate understanding of the code.
axis
parameter of dropna()
The code snippet below assumes axis=0
, as we have not passed a default value.
import pandas as pddataset = {'Name': ["John", "Mary", "Kane", "Duke"],'Height': [152, None, 137, 121],'Weight': [66, 81, 54, None]}df = pd.DataFrame(dataset)print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna()print("After cleaning:-")print(new_df)
Let’s understand the working of the above code:
Line 1: Importing the Pandas
library in which dropna()
is defined.
Lines 3–5: Making a dataset with some missing values.
Line 6: Converting the dataset to a Pandas
DataFrame.
Line 9: Displaying the DataFrame with NaN
values.
Line 11: Cleaning the original DataFrame by removing the row containing a NaN
value.
Line 13: Displaying the cleaned dataset.
Now, let’s specify axis=1
in our function and observe its behaviour.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(axis=1)print("After cleaning:-")print(new_df)
As we have set axis=1
in the function, the column containing the NaN
has been dropped.
how
parameter of dropna()
Let’s specify how="any"
and observe the output.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(how="any")print("After cleaning:-")print(new_df)
As we specified how="any,"
an entire row is dropped if any NaN value is present. Therefore, two rows containing one NaN
value each were removed.
Let’s specify how="all"
and see the behavior change.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(how="all")print("After cleaning:-")print(new_df)
As we specified how="all"
, a row is dropped only if all the values are NaN
. As there was no row where all the values were NaN
, the DataFrame remained unchanged.
thresh
parameter of dropna()
Let’s pass a specific value of the thresh
parameter and observe the function.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(thresh=3)print("After cleaning:-")print(new_df)
As we specified thresh=3
in the above code snippet, rows containing less than 3
non-NaN
values are dropped. Therefore, rows at indices 1 and 3 were dropped as they contained only 2
non-NaN
values.
We can further improve our understanding of this parameter by specifying thresh=4
.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(thresh=4)print("After cleaning:-")print(new_df)
As there is no row in our DataFrame with 4
non-NaN
values, the entire DataFrame is dropped.
subset
parameter of dropna()
Now we will specify a subset of rows containing NaN
values to be considered for removal. Let’s try with subset="Height"
.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linenew_df=df.dropna(subset="Height")print("After cleaning:-")print(new_df)
As we specified subset="Height"
. Only the rows containing NaN
values in the Height
column were dropped. Although some rows contained NaN
values in the Weight
column as well, they were not dropped as Weight
was not passed in the subset
parameter.
inplace
parameter of dropna()
Let’s observe the default behaviour without passing the inplace
parameter. The default behaviour considers inplace=0
.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linedf.dropna()print("After cleaning:-")print(df)
No change occurred in the original DataFrame. This happened because we didn’t specify the inplace
parameter to True
.
Let’s specify inpace=True
and observe the changes in the DataFrame.
print("Before cleaning:-")print(df)print('\n') #Printing an Empty linedf.dropna()print("After cleaning:-")print(df)
We notice that once we set inplace=True
, the original DataFrame has been modified.
The dropna()
function in pandas is an essential tool for handling missing values in datasets. It allows users to clean their data by removing rows or columns containing NaN values, making it easier to perform accurate analysis. The function is highly customizable, offering parameters like axis
, how
, thresh
, subset
, and inplace
for tailored data cleaning based on specific needs. Through flexible options, dropna()
provides an efficient solution for ensuring data integrity, enabling better analysis and predictions by working only with complete and reliable datasets.
Haven’t found what you were looking for? Contact Us
Free Resources