Given a PySpark DataFrame, we can select the columns based on a regex using the function colRegex
in PySpark.
DataFrame.colRegex(colName: str)
colName
: This represents a string or a column name specified as a regex.import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")]columns = ["firstname","lastname","country","state"]df = spark.createDataFrame(data = data, schema = columns)df.select(df.colRegex("`.+name$`")).show()
Note: If you see any warnings in the output, please ignore them.
Lines 1–2: We import pyspark
and SparkSession
.
Line 4: We create SparkSession
with the application name edpresso
.
Lines 6–10: We define the dummy data for the DataFrame.
Line 12: We define the columns for the dummy data.
Line 13: We create a spark DataFrame with the dummy data in lines 6–10 and the columns in line 13.
Line 14: We select the subset of the columns by using a regex. The regex matches the column names ending with the keyword name
. The colRegex()
method retrieves all columns ending with a name
.