Pandas >> GroupBy
Table of Contents
In this article, we will talk about how to use GroupBy in Pandas.
Sometimes, we need to group the rows into buckets to execute some operations like aggregation.
Prepare data
We need to prepare some sample data first for demonstration.
You can download the sample data file below.
Download the CSV file below
The data structure is as follows.
column | summary |
---|---|
id | student’s ID |
first_name | student’s first name |
last_name | student’s last name |
birth | student’s birthday |
gender | student’s gender |
class | student’s class |
subject | the subjects that students are learning |
score | student’s scores of subjects |
Read the CSV file
import pandas as pd
df = pd.read_csv("data/student_scores.csv", header=0)
df
Basic
We can use the groupby() method of DataFrame to group data.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
-
groupby()
The by parameter can accept one column or multiple columns.
For example, we can group by one column.
df.groupby("Subject")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393E800>
df.groupby(["class", "Subject"])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002365393D750>
-
General operation on DataFrameGroupBy object
The groupby() method will return a DataFrameGroupBy object.
DataFrameGroupBy object can do these operations as follow.
For example, we can apply some general aggregation operations on DataFrameGroupBy object like mean(), max(), min(), count() and etc.
The statistics functions will be applied on all the numeric columns automatically.
In the DataFrame, there are two numeric columns: id and score.
df.groupby("Subject").mean()
As you can see, the id column is also treated as number type. This is not we want to see. We can extract one column as SeriesGroupBy or multiple columns as DataFrameGroupBy.
For example, we can extract the score column only as SeriesGroupBy and get its mean value by using mean() method.
df.groupby("Subject")["score"].mean()
Actually, we don’t want to treat the id column as number. So we need to cast the id column in advance.
df["id"] = df["id"].astype(str)
df.groupby("Subject").mean()
As you can see, only the results of the Score column are output this time.
- Get a group by name
We can use get_group() method to get a specified group.
The parameter is one value of the group column.
The result will return a DataFrame with whose group column values are all the specified parameter.
df.groupby("Subject").get_group("Geometry")
- Iterate the results of groupby()
We can iterate through the results of groupby() in the general Python loop way.
For each loop, we can get the group name and corresponding DataFrames.
for name, group in df.groupby("Subject"):
print(f"name: {name}")
print(f"group: {group}")
print(f"------------------------------------------")
We can also iterate through the groups by using the groups attribute of the groupby result.
The structure of groups attribute is a dict as follows.
Key: Group name
Value: Index objects storing the row index in the original DataFrame.
For example, the first element of df.groupby(“Subject”).groups will be as below.
‘Calculus’: Int64Index([0, 3, 6, 9, 12, 15, 18, 21, 24])
for name, group in df.groupby("Subject").groups.items():
print(f"name: {name}")
print(f"group: {df.loc[group]}")
print(f"------------------------------------------")
- Get the the number of groups
We can use the len() method to get the number of groups.
subjects = df.groupby("Subject")
len(subjects)
Result: 3
- Get the record number of each group
We can use the size() method to get the record number of each group.
subjects.size()
Subject
Calculus 9
Geometry 9
Statistics 9
dtype: int64
- Get the first row of each group
We can use the first() method to get the first row of each group.
subjects.first()
- Get the last row of each group
We can use the last() method to get the last row of each group.
subjects.last()
Aggregate
We can apply single function on all columns (if type is supported) except column for group.
df.groupby("Subject").count()
# OR
df.groupby("Subject").agg("count")
# OR
df.groupby("Subject").aggregate("count")
We can also apply different function on each column. A mapping dictionary is needed.
Dictionay format is as follows.
Key: Column name
Value: Aggregation function name
df.groupby("Subject").agg({
"class": "count",
"score": "mean"})
Apply custom function
We can use apply() method to apply our own function on the grouped results.
The apply()’s parameter is the custom function and the parameter of custom function is the DataFrame of each group.
For example, we want to count the number of student whose socre is greater than or equal to 90.
We can define a custom function first and specify it in the apply() method.
def get_level_A_count(df): # df is the DataFrame of each group
return len(df[df["score"] >= 90])
# Set the custom function as the parameter of apply()
df.groupby("Subject").apply(get_level_A_count)
Conclusion
- We talked about the basic grouping way for single column (groupby()) and multiple columns(groupby([])).
- And apply general aggregation functions. (mean, min, max and etc.)
- We use get_group() to get a specified group.
- We iterate through the grouped results using the general python way for … in
- We talked about aggreation way on groupby results.
- Finally, we talked about how to use apply() to apply custom functions on the groupby results.
If you want to learn more about aggregation in Pandas, you can read the article below.
Pandas » usage of agg(), aggregate()
https://thats-it-code.com/pandas/pandas__usage-of-agg-aggregate/