How to select columns based on regex in PySpark

Overview

Given a PySpark DataFrame, we can select the columns based on a regex using the function colRegex in PySpark.

Syntax

DataFrame.colRegex(colName: str)

Parameter

  • colName: This represents a string or a column name specified as a regex.

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.select(df.colRegex("`.+name$`")).show()

Note: If you see any warnings in the output, please ignore them.

Explanation

  • Lines 1–2: We import pyspark and SparkSession.

  • Line 4: We create SparkSession with the application name edpresso.

  • Lines 6–10: We define the dummy data for the DataFrame.

  • Line 12: We define the columns for the dummy data.

  • Line 13: We create a spark DataFrame with the dummy data in lines 6–10 and the columns in line 13.

  • Line 14: We select the subset of the columns by using a regex. The regex matches the column names ending with the keyword name. The colRegex() method retrieves all columns ending with a name.

Free Resources