Apache Spark >> String

2022-04-13 Big data

Table of Contents

In this article, we will talk about how to work with Strings in Apache Spark.
The String type is probably the type we deal with most frequently.
There are many functions for manipulating Strings.
※Based on Apache Spark 3.2.1

Apache Spark Strings

Prepare data

We need to prepare some sample data first for demonstration.

Download the CSV file below

Download student_scores.csv

The detailed data structure is as follows.

column summary
id student’s ID
first_name student’s first name
last_name student’s last name
birth student’s birthday
gender student’s gender
class student’s class
subject the subjects that students are learning
score student’s scores of subjects

Read the CSV file

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .getOrCreate()

# Define the schema 
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DateType
schema = StructType([
    StructField("id", StringType(), False),
    StructField("first_name", StringType(), False),
    StructField("last_name", StringType(), False),
    StructField("birth", DateType(), True),
    StructField("gender", StringType(), True),
    StructField("class", StringType(), True),
    StructField("Subject", StringType(), True),
    StructField("score", IntegerType(), True)
])

# Read the csv
df = spark.read.option("header", True).format('csv').load('data/student_scores.csv',schema=schema, inferSchema=False)
df.show()

Apache Spark Strings

df.printSchema()

Apache Spark Strings

LowerCase, UpperCase

You can use lower(), upper() to convert the string to lowercase or uppercase.

import pyspark.sql.functions as F
df.select("Subject", F.lower("Subject").alias("Subject_lower"), F.upper("Subject").alias("Subject_upper")).show()
# OR
df.select(F.col("Subject"), F.lower(F.col("Subject")).alias("Subject_lower"), F.upper(F.col("Subject")).alias("Subject_upper")).show()
# OR
df.select(F.expr("Subject"), F.lower(F.expr("Subject")).alias("Subject_lower"), F.upper(F.expr("Subject")).alias("Subject_upper")).show()

Apache Spark Strings

Regular expressions

In the manipulation of strings, regular expressions are frequently used.
We can use regular expressions to search or replace strings.
In pyspark, we can use these two methods regexp_extract() and regexp_replace() to implement regular expressions.

For example, we can use regexp_extract() method to extract the year from the birth column as follows.
First parameter: The target column name, Column Object(using col()) or expression(using expr()).
Second parameter: The regular expression Third parameter: The group number you want to extract.
alias() method will give the result a new column name.
If the specified group did not match, an empty string is returned.

df.select(F.regexp_extract("birth", r'^(\d{4}).*$', 1).alias("birth_year")).show()

Apache Spark Strings

Next, let’s use regexp_replace() method to replace '-' in the birth column with '/'.
For example, ‘2000-01-07’ will be replaced with ‘2000/01/07’.

First parameter: The target column name, Column Object(using col()) or expression(using expr()).
Second parameter: The old pattern.
Third parameter: The new pattern.

df.select(F.regexp_replace(F.col("birth"), r'-', r'/').alias("birth_slash")).show()

Apache Spark Strings

Removing or adding spaces

In pyspark.sql.functions, there are many functions for manipulating strings.

trim(), ltrim(), rtrim() can be used for removing spaces before the words, after the words or both.

df.select(
          F.lit(" abc ").alias("origin"), 
          F.trim(F.lit(" abc ")).alias("origin_trim"), 
          F.ltrim(F.lit(" abc ")).alias("origin_ltrim"), 
          F.rtrim(F.lit(" abc ")).alias("origin_rtrim")
         ).show()

Apache Spark Strings

rpad() and lpad() can be used to add spaces or other characters on the left or right to make the string become to the specified length.

df.select(
          F.lit("cde").alias("origin"), 
          F.lpad(F.lit("cde"), 5, " ").alias("origin_lpad_spaces"), 
          F.lpad(F.lit("cde"), 5, "*").alias("origin_lpad_symbols"), 
          F.rpad(F.lit("cde"), 5, " ").alias("origin_rpad_spaces"), 
          F.rpad(F.lit("cde"), 5, "*").alias("origin_rpad_symbols"), 
         ).show()

Apache Spark Strings

Containing substring or not

We can use instr() method in pyspark.sql.functions to check if the values of Column ‘first_name’ contain the character ‘a’. The position (1-based index) will be returned. If ‘a’ is contained, the result will be greater than 1. Otherwise, 0 will be returned.
Firstly, let’s check all the results.

df.select("first_name", F.instr("first_name", "a").alias("position")).show()

Apache Spark Strings

Next, let’s filter all the rows with the first_name column containing the character ‘a’.
We can get the boolean results that indicate whether containing ‘a’ or not first.
And add the boolean flag as a new column, then use it to filter the rows.

# Check if the values of Column 'first_name' contain character 'a'
check_containing = F.instr(F.col("first_name"), "a") >= 1 
df.withColumn("containA", check_containing) \
    .where("containA") \
    .show(100, False)

Apache Spark Strings

If you want to learn more about adding columns, you can read the article below.

Apache Spark » Column and Row
https://thats-it-code.com/apachespark/apachespark__column-and-row/

Split one column into multiple columns

We can use the split() method in pyspark.sql.functions to split the value of one column into multiple values and convert them to new columns.

pyspark.sql.functions.split
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.split.html

For example, we want to split the birth column into three columns (year, month, day).

birth
2000-01-07

👇

year month day
2000 01 07

We split the birth column and add three new columns by referencing the results.

pyspark.sql.functions.split(str, pattern, limit=- 1)

str: The column to split
pattern: Regular expression
limit: How many times the pattern is applied. The default value is -1, which means the pattern will be applied as many times as possible.

split_results = F.split("birth", "-")
df.withColumn("year", split_results[0]) \
    .withColumn("month", split_results[1]) \
    .withColumn("day", split_results[2]).show()

Apache Spark Strings

Conclusion

We can change the string’s case by using lower(), upper().
We can use regexp_extract() to extract some patterns from the column values.
We can use regexp_replace() to replace some patterns in the column values.
We can use trim(), ltrim(), rtrim() to removing spaces in the column values.
We can use rpad(), lpad() to add spaces or other characters on the left or right to make the column values become to the specified length.
We can use instr() to check whether some substrings are in the column values or not.
We can use split() to split one column into multiple columns.

Subscribe and be the FIRST reader of our latest articles

* indicates required

Contact us