It is essential to drop `null` values from a `DataFrame` for most use cases.

The `dropna` method of `pyspark` API is used to remove or omit null values in a `DataFrame`.

## Syntax

```
DataFrame.dropna(how='any', thresh=None, subset=None)
```

## Parameters

- `how`: This parameter can have two values, `all` and `any`. If specified `any`, the method drops a row if it contains any nulls. If specified `all`, the method drops a row only if all its values are `null`.
- `thresh`: This parameter (if specified) indicates dropping rows with less than thresh non-null values.
- `subset`: This is the list of column names to consider.

All of the parameters above are optional.

## Return value

This method returns a new `DataFrame` with no null values.

## Dropping with `how=any` 
Let's look at the code below:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('answers').getOrCreate()

data = [("James","Educative","Engg","USA"),
    ("Michael","Google",None,"Asia"),
    ("Robert",None,"Marketing","Russia"),
    ("Maria","Netflix","Finance","Ukraine")
  ]

columns = ["emp name","company","department","country"]
df = spark.createDataFrame(data = data, schema = columns)

df_any = df.dropna(how="any")

df_any.show(truncate=False)

python39


## Code explanation

- **Lines 1–2:** The `pyspark` and `SparkSession` are imported.
- **Line 4:** A `SparkSession` with the application name `answers` is created.
- **Lines 6–12:** The dummy data for the DataFrame with the columns are defined. The data contains `None` values.
- **Line 13:** A spark `Dataframe` is created.
- **Line 15:** The `null` values are dropped by invoking the `dropna()` method on the `DataFrame` with the `how` parameter as `any`.
- **Line 16:** The new `DataFrame` obtained after dropping the null values is printed.

## Dropping with `how=all`

Let's look at the code below:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('answers').getOrCreate()

data = [("James","Educative","Engg","USA"),
    ("Michael","Google",None,"Asia"),
    ("Robert",None,"Marketing","Russia"),
    ("Maria","Netflix","Finance","Ukraine"),
    (None, None, None, None)
  ]

columns = ["emp name","company","department","country"]
df = spark.createDataFrame(data = data, schema = columns)

df_all = df.dropna(how="all")

df_all.show(truncate=False)


## Code explanation

- **Lines 1 to 2:** The `pyspark` and `SparkSession` are imported.
- **Line 4:** A `SparkSession` with the application name `answers` is created.
- **Lines 6 to 12:** The dummy data for the DataFrame with the columns are defined. The data contains `None` values.
- **Line 13:** A spark `DataFrame` is created.
- **Line 15:** The `null` values are dropped by invoking the `dropna()` method on the `DataFrame` with the `how` parameter as `all`.
- **Line 16:** The new `dataFrame` obtained after dropping the null values is printed.

pyspark.tar.gz

How to drop null values in Pyspark

Use DataFrame.dropna() in PySpark to remove null values, specifying criteria through parameters like how, thresh, and subset.

How to drop null values in Pyspark

Syntax

Parameters

Return value

Dropping with how=any

Code explanation

Dropping with how=all

Code explanation

Dropping with `how=any`

Dropping with `how=all`