Apache Spark >> String
Table of Contents
In this article, we will talk about how to work with Strings in Apache Spark.
The String type is probably the type we deal with most frequently.
There are many functions for manipulating Strings.
※Based on Apache Spark 3.2.1
Prepare data
We need to prepare some sample data first for demonstration.
Download the CSV file below
The detailed data structure is as follows.
column | summary |
---|---|
id | student’s ID |
first_name | student’s first name |
last_name | student’s last name |
birth | student’s birthday |
gender | student’s gender |
class | student’s class |
subject | the subjects that students are learning |
score | student’s scores of subjects |
Read the CSV file
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("MyApp") \
.getOrCreate()
# Define the schema
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DateType
schema = StructType([
StructField("id", StringType(), False),
StructField("first_name", StringType(), False),
StructField("last_name", StringType(), False),
StructField("birth", DateType(), True),
StructField("gender", StringType(), True),
StructField("class", StringType(), True),
StructField("Subject", StringType(), True),
StructField("score", IntegerType(), True)
])
# Read the csv
df = spark.read.option("header", True).format('csv').load('data/student_scores.csv',schema=schema, inferSchema=False)
df.show()
df.printSchema()
LowerCase, UpperCase
You can use lower(), upper() to convert the string to lowercase or uppercase.
import pyspark.sql.functions as F
df.select("Subject", F.lower("Subject").alias("Subject_lower"), F.upper("Subject").alias("Subject_upper")).show()
# OR
df.select(F.col("Subject"), F.lower(F.col("Subject")).alias("Subject_lower"), F.upper(F.col("Subject")).alias("Subject_upper")).show()
# OR
df.select(F.expr("Subject"), F.lower(F.expr("Subject")).alias("Subject_lower"), F.upper(F.expr("Subject")).alias("Subject_upper")).show()
Regular expressions
In the manipulation of strings, regular expressions are frequently used.
We can use regular expressions to search or replace strings.
In pyspark, we can use these two methods regexp_extract() and regexp_replace() to implement regular expressions.
For example, we can use regexp_extract() method to extract the year from the birth column as follows.
First parameter: The target column name, Column Object(using col()) or expression(using expr()).
Second parameter: The regular expression
Third parameter: The group number you want to extract.
alias() method will give the result a new column name.
If the specified group did not match, an empty string is returned.
df.select(F.regexp_extract("birth", r'^(\d{4}).*$', 1).alias("birth_year")).show()
Next, let’s use regexp_replace() method to replace '-' in the birth column with '/'.
For example, ‘2000-01-07’ will be replaced with ‘2000/01/07’.
First parameter: The target column name, Column Object(using col()) or expression(using expr()).
Second parameter: The old pattern.
Third parameter: The new pattern.
df.select(F.regexp_replace(F.col("birth"), r'-', r'/').alias("birth_slash")).show()
Removing or adding spaces
In pyspark.sql.functions, there are many functions for manipulating strings.
trim(), ltrim(), rtrim() can be used for removing spaces before the words, after the words or both.
df.select(
F.lit(" abc ").alias("origin"),
F.trim(F.lit(" abc ")).alias("origin_trim"),
F.ltrim(F.lit(" abc ")).alias("origin_ltrim"),
F.rtrim(F.lit(" abc ")).alias("origin_rtrim")
).show()
rpad() and lpad() can be used to add spaces or other characters on the left or right to make the string become to the specified length.
df.select(
F.lit("cde").alias("origin"),
F.lpad(F.lit("cde"), 5, " ").alias("origin_lpad_spaces"),
F.lpad(F.lit("cde"), 5, "*").alias("origin_lpad_symbols"),
F.rpad(F.lit("cde"), 5, " ").alias("origin_rpad_spaces"),
F.rpad(F.lit("cde"), 5, "*").alias("origin_rpad_symbols"),
).show()
Containing substring or not
We can use instr() method in pyspark.sql.functions to check if the values of Column ‘first_name’ contain the character ‘a’. The position (1-based index) will be returned.
If ‘a’ is contained, the result will be greater than 1. Otherwise, 0 will be returned.
Firstly, let’s check all the results.
df.select("first_name", F.instr("first_name", "a").alias("position")).show()
Next, let’s filter all the rows with the first_name column containing the character ‘a’.
We can get the boolean results that indicate whether containing ‘a’ or not first.
And add the boolean flag as a new column, then use it to filter the rows.
# Check if the values of Column 'first_name' contain character 'a'
check_containing = F.instr(F.col("first_name"), "a") >= 1
df.withColumn("containA", check_containing) \
.where("containA") \
.show(100, False)
If you want to learn more about adding columns, you can read the article below.
Apache Spark » Column and Row
https://thats-it-code.com/apachespark/apachespark__column-and-row/
Split one column into multiple columns
We can use the split() method in pyspark.sql.functions to split the value of one column into multiple values and convert them to new columns.
pyspark.sql.functions.split
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.split.html
For example, we want to split the birth column into three columns (year, month, day).
birth |
---|
2000-01-07 |
👇
year | month | day |
---|---|---|
2000 | 01 | 07 |
We split the birth column and add three new columns by referencing the results.
pyspark.sql.functions.split(str, pattern, limit=- 1)
str: The column to split
pattern: Regular expression
limit: How many times the pattern is applied. The default value is -1, which means the pattern will be applied as many times as possible.
split_results = F.split("birth", "-")
df.withColumn("year", split_results[0]) \
.withColumn("month", split_results[1]) \
.withColumn("day", split_results[2]).show()
Conclusion
We can change the string’s case by using lower(), upper().
We can use regexp_extract() to extract some patterns from the column values.
We can use regexp_replace() to replace some patterns in the column values.
We can use trim(), ltrim(), rtrim() to removing spaces in the column values.
We can use rpad(), lpad() to add spaces or other characters on the left or right to make the column values become to the specified length.
We can use instr() to check whether some substrings are in the column values or not.
We can use split() to split one column into multiple columns.