GroupBy

2022-04-14 Pandas

In this article, we will talk about how to use GroupBy in Pandas.
Sometimes, we need to group the rows into buckets to execute some operations like aggregation.

Pandas GroupBy

Prepare data

We need to prepare some sample data first for demonstration.
You can download the sample data file below.

Download the CSV file below

Download student_scores.csv

The data structure is as follows.

column	summary
id	student’s ID
first_name	student’s first name
last_name	student’s last name
birth	student’s birthday
gender	student’s gender
class	student’s class
subject	the subjects that students are learning
score	student’s scores of subjects

Read the CSV file

import pandas as pd

df = pd.read_csv("data/student_scores.csv", header=0)
df

Pandas Groupby

Basic

We can use the groupby() method of DataFrame to group data.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

groupby()

The by parameter can accept one column or multiple columns.
For example, we can group by one column.

df.groupby("Subject")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393E800>

df.groupby(["class", "Subject"])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393D750>

General operation on DataFrameGroupBy object

The groupby() method will return a DataFrameGroupBy object.
DataFrameGroupBy object can do these operations as follow.
For example, we can apply some general aggregation operations on DataFrameGroupBy object like mean(), max(), min(), count() and etc.
The statistics functions will be applied on all the numeric columns automatically.

In the DataFrame, there are two numeric columns: id and score.

df.groupby("Subject").mean()

Pandas Groupby

As you can see, the id column is also treated as number type. This is not we want to see. We can extract one column as SeriesGroupBy or multiple columns as DataFrameGroupBy.

For example, we can extract the score column only as SeriesGroupBy and get its mean value by using mean() method.

df.groupby("Subject")["score"].mean()

Pandas Groupby

Actually, we don’t want to treat the id column as number. So we need to cast the id column in advance.

df["id"] = df["id"].astype(str)
df.groupby("Subject").mean()

As you can see, only the results of the Score column are output this time.
Pandas Groupby

Get a group by name
We can use get_group() method to get a specified group.
The parameter is one value of the group column.
The result will return a DataFrame with whose group column values are all the specified parameter.

df.groupby("Subject").get_group("Geometry")

Pandas Groupby

Iterate the results of groupby()
We can iterate through the results of groupby() in the general Python loop way.
For each loop, we can get the group name and corresponding DataFrames.

for name, group in df.groupby("Subject"):
    print(f"name: {name}")
    print(f"group: {group}")
    print(f"------------------------------------------")

Pandas Groupby

We can also iterate through the groups by using the groups attribute of the groupby result.
The structure of groups attribute is a dict as follows.
Key: Group name
Value: Index objects storing the row index in the original DataFrame.
For example, the first element of df.groupby(“Subject”).groups will be as below.

‘Calculus’: Int64Index([0, 3, 6, 9, 12, 15, 18, 21, 24])

for name, group in df.groupby("Subject").groups.items():
    print(f"name: {name}")
    print(f"group: {df.loc[group]}")
    print(f"------------------------------------------")

Pandas Groupby

Get the the number of groups
We can use the len() method to get the number of groups.

subjects = df.groupby("Subject")
len(subjects)

Result: 3

Get the record number of each group
We can use the size() method to get the record number of each group.

subjects.size()

Subject
Calculus 9
Geometry 9
Statistics 9
dtype: int64

Get the first row of each group
We can use the first() method to get the first row of each group.

subjects.first()

Pandas Groupby

Get the last row of each group
We can use the last() method to get the last row of each group.

subjects.last()

Pandas Groupby

Aggregate

We can apply single function on all columns (if type is supported) except column for group.

df.groupby("Subject").count()
# OR
df.groupby("Subject").agg("count")
# OR
df.groupby("Subject").aggregate("count")

Pandas Groupby

We can also apply different function on each column. A mapping dictionary is needed.
Dictionay format is as follows.
Key: Column name
Value: Aggregation function name

df.groupby("Subject").agg({
    "class": "count", 
    "score": "mean"})

Pandas Groupby

Apply custom function

We can use apply() method to apply our own function on the grouped results.
The apply()’s parameter is the custom function and the parameter of custom function is the DataFrame of each group.
For example, we want to count the number of student whose socre is greater than or equal to 90.
We can define a custom function first and specify it in the apply() method.

def get_level_A_count(df): # df is the DataFrame of each group
    return len(df[df["score"] >= 90])

# Set the custom function as the parameter of apply()
df.groupby("Subject").apply(get_level_A_count)

Pandas Groupby

Conclusion

We talked about the basic grouping way for single column (groupby()) and multiple columns(groupby([])).
And apply general aggregation functions. (mean, min, max and etc.)
We use get_group() to get a specified group.
We iterate through the grouped results using the general python way for … in
We talked about aggreation way on groupby results.
Finally, we talked about how to use apply() to apply custom functions on the groupby results.

If you want to learn more about aggregation in Pandas, you can read the article below.

Pandas » usage of agg(), aggregate()
https://thats-it-code.com/pandas/pandas__usage-of-agg-aggregate/

That's it ! Code Snippets

Pandas >> GroupBy

Table of Contents

Prepare data

Basic

Aggregate

Apply custom function

Conclusion

Contact us

Pandas >> GroupBy

Table of Contents

Prepare data

Basic

Aggregate

Apply custom function

Conclusion

Subscribe and be the FIRST reader of our latest articles

Contact us