How to Select Rows Containing Specified String
Table of Contents
In this article, we will talk about how to extract rows containing specified string or satisfying some conditions.
Firstly, we prepare data.
Preparing data
import pandas as pd
df = pd.DataFrame({
"name": ["Kevin", "Jack", "Mary", "Bob", "Robert", "Amy"],
"score": [80, 90, 95, 93, 88, 81],
"class": ["A", "B", "A", "A", "B", "B"],
"Interests": ["Writing, Sports", "Sports, Traveling", "Violin, Reading", "Reading, Blogging","Learning, Reading, Writing","Volunteering, Piano"]
}, index=["K", "J", "M", "B", "R", "A"])
df
How to select rows from DataFrame
We can select rows from DataFrame by one condition (==,!=, >, <) or multiple conditions combination (&, |). You can read the article below to understand how to select rows from DataFrame in detail.
Pandas » How to Filter Data
https://thats-it-code.com/pandas/pandas__how-to-filter-data/
Select all students of class ‘A’
df[df["class"] == 'A']
Select all students with score greater than 80
df[df["score"] > 80]
Select all students of class ‘A’ with score greater than 80.
df[(df["class"] == 'A') & (df["score"] > 80)]
Select rows with specified column has exactly the same value (==)
We can use == to select data with exactly same value.
For example, we want to select persons whose interests are Reading, Blogging.
df[df["Interests"] == "Reading, Blogging"]
Select rows with those columns contain specified string (str.contains())
We can use str.contains() method in filter condition to select rows with those columns
contain some string.
For Example, we want to select students who are interested in Reading.
df[df["Interests"].str.contains("Reading")]
When there are NaN, ValueError will occur when selecting.
Firstly, we add one row whose Interests is NaN.
import numpy as np
new_row = pd.DataFrame({"name": "Tom", "score": 85, "class": 'C', "Interests": np.nan}, index=["T"])
df2 = pd.concat([df, new_row], axis=0)
df2
If we execute code above, we will get ValueError.
df2[df2["Interests"].str.contains("Reading")]
So to avoid above error, we can firstly fill NaN with empty string and apply selecting operation.
You can read the article below to find how to fill NaN value in Pandas.
How to Process Null Values in Pandas
https://thats-it-code.com/pandas/how-to-process-null-values/
df2["Interests"] = df2["Interests"].fillna("")
df2[df2["Interests"].str.contains("Reading")]
Use case option in contains to specify case sensitive or case insensitive
For example, we can select above rows by using lower case reading.
df[df["Interests"].str.contains("reading", case=False)]
Use regex expression in contains
We can also use regex expression in contains() method.
For example, we can use .*V.* to select students who are interested in Voilin and Volunteering.
df[df["Interests"].str.contains(".*V.*")]
By default regex mode is open. If we want to turn it off, we can specify regex=False.
df[df["Interests"].str.contains(".*V.*", regex=False)]
Use str.startswith() to find rows whose column starts with a pattern.
For example, we can find students whose interests start with Learning as below.
df[df["Interests"].str.startswith("Learning")]
Use str.endswith() to find rows whose column ends with a pattern.
For example, we can find students whose interests end with ing as below.
df[df["Interests"].str.endswith("ing")]
Use str.match() to find rows whose column values match regex pattern.
For example, we can find students who have three interests as below.
[^,] means character except comma, + means at least occur one time.
df[df["Interests"].str.match("[^,]+,[^,]+,[^,]+")]
Conclusion
- We can use == to select rows whose column value exactly equals specified value.
- We can use str.contains() method to select rows whose column values contain some value or regex pattern.
- We can also use str.startswith(), str.endswith(), str.match() method to select rows by partially matched some patterns.