Pandas >> GroupBy

2022-04-14 Pandas

Table of Contents

In this article, we will talk about how to use GroupBy in Pandas.
Sometimes, we need to group the rows into buckets to execute some operations like aggregation.

Pandas GroupBy

Prepare data

We need to prepare some sample data first for demonstration.
You can download the sample data file below.

Download the CSV file below

Download student_scores.csv

The data structure is as follows.

column summary
id student’s ID
first_name student’s first name
last_name student’s last name
birth student’s birthday
gender student’s gender
class student’s class
subject the subjects that students are learning
score student’s scores of subjects

Read the CSV file

import pandas as pd

df = pd.read_csv("data/student_scores.csv", header=0)
df

Pandas Groupby

Basic

We can use the groupby() method of DataFrame to group data.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

  • groupby()

    The by parameter can accept one column or multiple columns.
    For example, we can group by one column.

df.groupby("Subject")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393E800>

df.groupby(["class", "Subject"])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393D750>

  • General operation on DataFrameGroupBy object

    The groupby() method will return a DataFrameGroupBy object.
    DataFrameGroupBy object can do these operations as follow.
    For example, we can apply some general aggregation operations on DataFrameGroupBy object like mean(), max(), min(), count() and etc.
    The statistics functions will be applied on all the numeric columns automatically.

In the DataFrame, there are two numeric columns: id and score.

df.groupby("Subject").mean()

Pandas Groupby

As you can see, the id column is also treated as number type. This is not we want to see. We can extract one column as SeriesGroupBy or multiple columns as DataFrameGroupBy.

For example, we can extract the score column only as SeriesGroupBy and get its mean value by using mean() method.

df.groupby("Subject")["score"].mean()

Pandas Groupby

Actually, we don’t want to treat the id column as number. So we need to cast the id column in advance.

df["id"] = df["id"].astype(str)
df.groupby("Subject").mean()

As you can see, only the results of the Score column are output this time.
Pandas Groupby

  • Get a group by name
    We can use get_group() method to get a specified group.
    The parameter is one value of the group column.
    The result will return a DataFrame with whose group column values are all the specified parameter.
df.groupby("Subject").get_group("Geometry")

Pandas Groupby

  • Iterate the results of groupby()
    We can iterate through the results of groupby() in the general Python loop way.
    For each loop, we can get the group name and corresponding DataFrames.
for name, group in df.groupby("Subject"):
    print(f"name: {name}")
    print(f"group: {group}")
    print(f"------------------------------------------")

Pandas Groupby

We can also iterate through the groups by using the groups attribute of the groupby result.
The structure of groups attribute is a dict as follows.
Key: Group name
Value: Index objects storing the row index in the original DataFrame.
For example, the first element of df.groupby(“Subject”).groups will be as below.

‘Calculus’: Int64Index([0, 3, 6, 9, 12, 15, 18, 21, 24])

for name, group in df.groupby("Subject").groups.items():
    print(f"name: {name}")
    print(f"group: {df.loc[group]}")
    print(f"------------------------------------------")

Pandas Groupby

  • Get the the number of groups
    We can use the len() method to get the number of groups.
subjects = df.groupby("Subject")
len(subjects)

Result: 3

  • Get the record number of each group
    We can use the size() method to get the record number of each group.
subjects.size()

Subject
Calculus 9
Geometry 9
Statistics 9
dtype: int64

  • Get the first row of each group
    We can use the first() method to get the first row of each group.
subjects.first()

Pandas Groupby

  • Get the last row of each group
    We can use the last() method to get the last row of each group.
subjects.last()

Pandas Groupby

Aggregate

We can apply single function on all columns (if type is supported) except column for group.

df.groupby("Subject").count()
# OR
df.groupby("Subject").agg("count")
# OR
df.groupby("Subject").aggregate("count")

Pandas Groupby

We can also apply different function on each column. A mapping dictionary is needed.
Dictionay format is as follows.
Key: Column name
Value: Aggregation function name

df.groupby("Subject").agg({
    "class": "count", 
    "score": "mean"})

Pandas Groupby

Apply custom function

We can use apply() method to apply our own function on the grouped results.
The apply()’s parameter is the custom function and the parameter of custom function is the DataFrame of each group.
For example, we want to count the number of student whose socre is greater than or equal to 90.
We can define a custom function first and specify it in the apply() method.

def get_level_A_count(df): # df is the DataFrame of each group
    return len(df[df["score"] >= 90])

# Set the custom function as the parameter of apply()
df.groupby("Subject").apply(get_level_A_count)

Pandas Groupby

Conclusion

  • We talked about the basic grouping way for single column (groupby()) and multiple columns(groupby([])).
  • And apply general aggregation functions. (mean, min, max and etc.)
  • We use get_group() to get a specified group.
  • We iterate through the grouped results using the general python way for … in
  • We talked about aggreation way on groupby results.
  • Finally, we talked about how to use apply() to apply custom functions on the groupby results.

If you want to learn more about aggregation in Pandas, you can read the article below.

Pandas » usage of agg(), aggregate()
https://thats-it-code.com/pandas/pandas__usage-of-agg-aggregate/

Subscribe and be the FIRST reader of our latest articles

* indicates required

Contact us