Pandas >> Select Rows From a DataFrame Based on Column Values

2023-03-02 Pandas

Table of Contents

In this tutorial, we will talk about how to select rows based on column values or relations between columns in Pandas, use boolean indexing with logical operators or query method, and methods using regular expression like str.match(), str.contains() and python re module.

Pandas

Basic

To select rows from a Pandas DataFrame based on column values, you can use boolean indexing. Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
   'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
   'Age': [25, 30, 35, 40, 45],
   'Gender': ['F', 'M', 'M', 'M', 'F']
})

# Select rows where Age is greater than or equal to 35
result = df[df['Age'] >= 35]

# Display the result
print(result)

In this example, we first create a sample DataFrame with columns ‘Name’, ‘Age’, and ‘Gender’. We then use boolean indexing to select rows where the ‘Age’ column is greater than or equal to 35. The resulting DataFrame is stored in the result variable and printed to the console.

You can also use other comparison operators such as == for equality, < for less than, > for greater than, <= for less than or equal to, and != for not equal to.

You can also use multiple conditions by combining them using the & operator for “and” and the | operator for “or”. Here’s an example:

# Select rows where Age is greater than or equal to 35 and Gender is 'M'
result = df[(df['Age'] >= 35) & (df['Gender'] == 'M')]

# Display the result
print(result)

This will select rows where the ‘Age’ column is greater than or equal to 35 and the ‘Gender’ column is ‘M’. The resulting DataFrame is stored in the result variable and printed to the console.

Using query()

Using the query() method: This method allows you to write a SQL-like query to filter rows based on column values. Here’s an example:

# Select rows where Age is greater than or equal to 35 using the query() method
result = df.query('Age >= 35')

# Display the result
print(result)

This will select rows where the ‘Age’ column is greater than or equal to 35 using the query() method.

You can also use the query() method with a parameter. The query() method can accept a parameter that represents variables in the current scope. Here’s an example:

# Define a variable
age_threshold = 35

# Select rows where Age is greater than or equal to age_threshold using the query() method
result = df.query('Age >= @age_threshold')

# Display the result
print(result)

In this example, we define a variable age_threshold that is set to 35. We then use the query() method with the expression Age >= @age_threshold. The @ symbol indicates that age_threshold is a variable in the current scope, rather than a column name in the DataFrame.

This method can be useful if you have a specific threshold value that you want to use to filter rows, and you want to avoid hard-coding the value in the query string. By defining a variable in the current scope, you can easily modify the threshold value without modifying the query string.

Using isin()

Using the isin() method: This method allows you to select rows where a column value is in a list of values. Here’s an example:

# Select rows where Name is 'Bob' or 'Charlie' using the isin() method
result = df[df['Name'].isin(['Bob', 'Charlie'])]

# Display the result
print(result)

This will select rows where the ‘Name’ column is either ‘Bob’ or ‘Charlie’ using the isin() method.

Using loc[]

Using the loc[] method: This method allows you to select rows based on both row and column labels. Here’s an example:

# Select rows where Age is greater than or equal to 35 using the loc[] method
result = df.loc[df['Age'] >= 35]

# Display the result
print(result)

This will select rows where the ‘Age’ column is greater than or equal to 35 using the loc[] method.

Complex expression

You can use the boolean indexing technique along with some complex logical expressions that combine multiple conditions.

# Create a sample DataFrame
df = pd.DataFrame({
   'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
   'Age': [25, 30, 35, 40, 45],
   'Gender': ['F', 'M', 'M', 'M', 'F'],
   'Salary': [50000, 70000, 80000, 90000, 60000]
})

# Define a complex condition to select rows based on Name, Age, Gender and Salary
result = df[(df['Name'].str.contains('a')) & (df['Age'] >= 30) & ((df['Gender'] == 'M') | (df['Salary'] >= 80000))]

# Display the result
print(result)

In this example, we define a complex condition to select rows based on whether the ‘Name’ column contains the letter ‘a’, the ‘Age’ column is greater than or equal to 30, and either the ‘Gender’ column is ‘M’ OR the ‘Salary’ column is greater than or equal to 80000. We use the str.contains() method to check whether the ‘Name’ column contains the letter ‘a’. We combine the conditions using the & and | operators and enclose each condition in parentheses to specify the order of operations. The resulting DataFrame is stored in the result variable and printed to the console.

Select by regular expression

str.match()

The str.match() method in Pandas is used to extract only the rows from a DataFrame where the values in a particular column match a regular expression pattern.

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
   'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
   'Age': [25, 30, 35, 40, 45],
   'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Seattle']
})

# Select rows where the 'City' column starts with the letter 'N'
result = df[df['City'].str.match('^N')]

# Display the result
print(result)

In this example, we use the str.match() method to select rows where the ‘City’ column starts with the letter ‘N’. We pass the regular expression pattern ‘^N’ as an argument to the method to match only strings that start with the letter ‘N’. The resulting DataFrame is stored in the result variable and printed to the console.

Note that the str.match() method only matches patterns at the beginning of the string. If you want to match patterns anywhere in the string, you should use the str.contains() method instead. Also, like with str.contains(), you can adjust the regular expression pattern to fit your specific needs and criteria.

str.contains()

To select rows in a Pandas DataFrame using regular expressions (regex), you can use the str.contains() method.

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
   'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
   'Age': [25, 30, 35, 40, 45],
   'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Seattle']
})

# Select rows where the 'City' column contains the string 'New' or 'San'
result = df[df['City'].str.contains('New|San')]

# Display the result
print(result)

In this example, we use the str.contains() method to select rows where the ‘City’ column contains the string ‘New’ or ‘San’. We pass the regular expression pattern ‘New|San’ as an argument to the method to match either string. The resulting DataFrame is stored in the result variable and printed to the console.

You can adjust the regular expression pattern to fit your specific needs and criteria. Note that the str.contains() method is case-sensitive by default, but you can set the case parameter to False to make it case-insensitive.

Using re module

You can also use the Python re module to select rows based on a regular expression pattern. Here’s an example:

import pandas as pd
import re

# Create a sample DataFrame
df = pd.DataFrame({
   'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eva'],
   'Age': [25, 30, 35, 40, 45],
   'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Seattle']
})

# Select rows where the 'City' column contains the string 'New' or 'San'
regex_pattern = re.compile('New|San')
result = df[df['City'].apply(lambda x: bool(regex_pattern.search(x)))]

# Display the result
print(result)

In this example, we first import the re module to create a regular expression pattern to match strings containing either ‘New’ or ‘San’. We then use the apply() method to apply the regular expression pattern to each value in the ‘City’ column of the DataFrame. We use the search() method of the regular expression pattern to search for a match in each value.

Finally, we use the bool() function to convert the search result to a Boolean value (i.e., True if there’s a match and False otherwise). The resulting Boolean values are used to select only the rows where there is a match in the ‘City’ column. The resulting DataFrame is stored in the result variable and printed to the console.

Note that using the apply() method can be slower than using the str.contains() method directly, especially for large DataFrames. Therefore, it’s recommended to use the str.contains() method whenever possible.

Subscribe and be the FIRST reader of our latest articles

* indicates required

Contact us