Pandas >> Optimization

2022-04-27 Pandas

Table of Contents

In this article, let’s go through some optimization techniques in Pandas together.

Pandas optimizing

Data preparation

In order to produce sufficient difference after optimization, we need to prepare enough data.
Here, we create a DataFrame with 100,000 rows.

import pandas as pd

df = pd.DataFrame(
    [
        ["123", "abc", "123456.78", "2022-01-02 01:02:01"],
        ["234", "def", "223456.78", "2022-02-03 02:03:02"],
        ["567", "ghi", "323456.78", "2022-03-04 06:04:03"],
        ["890", "jkl", "423456.78", "2022-04-11 12:05:04"],
        ["123", "mno", "523456.78", "2022-05-25 19:06:05"],
    ] * 20000,
    columns=["col_1", "col_2", "col_3", "col_4"]
)
len(df)

100000
We generated 100 thousand rows.

Memory

Firstly, let’s go through some memory-related optimization techniques.
All the columns of the data prepared above are strings.
Memory usage:  25.0MB.

df.info(memory_usage='deep')

Pandas optimizing

string -> datetime

Let’s convert the col_4 to the date-time type (Timestamp).
Memory usage:  18.5MB  
Reduction:  6.5MB(26%)↓.

df["col_4"] = pd.to_datetime(df["col_4"])
df.info(memory_usage='deep')

Pandas optimizing

string -> number

Next, let’s convert col_1 to integer, and convert col_3 to float.
Memory usage:  7.1MB  
Reduction:  17.9MB(71.6%)↓.

import numpy as np
df = df.astype({"col_1": np.int16, "col_3": np.float32})

df.info(memory_usage='deep')

Pandas optimizing

string -> category

We can convert the columns with a finite number of possible values to category type.
Memory usage:  1.4MB  
Reduction:  23.6MB(94.4%)↓.

df["col_2"] = df["col_2"].astype("category")
df.info(memory_usage='deep')

Pandas optimizing

To recap, the memory size is reduced by 94.4% by converting data types.

For more about data type conversions, you can reference the article below.

Pandas » Data Type Conversions
https://thats-it-code.com/pandas/pandas__data-type-conversions/

Computing

For example, we want to add up col_1 and col_3 and assign the result to a new column named total.
We can achieve this in the following 5 ways.

  • General Row-by-Row Loop
  • iterrows()
  • itertuples()
  • apply()
  • vectorization

General Row-by-Row Loop

Firstly, let’s loop the data row by row and add col_1 and col_3.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []
# Calculating execution time  
%%timeit

for i in range(len(temp_df)):
    # Save the computing result of each row to a list
    new_col.append(temp_df.loc[i, "col_1"] + temp_df.loc[i, "col_3"])
# Assign the result to a new column named total  
temp_df["total"] = pd.Series(new_col)

1.52 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

iterrows()

Next, let’s use the iterrows() method of DataFrame to loop the data.
iterrows() will yield row index and row Series object each time.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []
%%timeit

for i, row in temp_df.iterrows():
    new_col.append(row["col_1"] + row["col_3"]) 
temp_df["total"] = pd.Series(new_col)

2.88 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

itertuples()

Next, let’s use the itertuples() method of DataFrame to loop the data.
itertuples() will iterate over DataFrame rows as namedtuples.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []
%%timeit

# By default, index=True -> for index, row in ...:
for row in temp_df.itertuples(index=False):
     new_col.append(row[0] + row[2])
temp_df["total"] = pd.Series(new_col)

426 ms ± 138 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

apply()

We can also use apply() method of DataFrame to reach our goal.

# The function definition to be applied
def print_col(row):
    return row["col_1"] + row["col_3"]

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
%%timeit
# axis=1 means apply this operation on all rows.
temp_df["total"] = temp_df.apply(print_col, axis=1)

817 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vectorization

Finally, let’s use the built-in method to add two columns.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
%%timeit
# Use the built-in adding operation
temp_df["total"] = temp_df["col_1"] + temp_df["col_3"]

211 µs ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s summarize all the results.
As the test results show, the fasted way is the vectorization. So let’s use the built-in operations as much as possible.

If we have to loop row by row, itertuples() should be used first.
itertuples() is much faster than iterrows(). In this test, itertuples() is almost 7 times faster than iterrows().

apply() is slower than itertuples(), the Row-by-Row loop and iterrows() is very slow and we should avoid using them.

Pandas optimizing

I/O

Next, let’s compare the I/O performance of various file formats.
To compare the writing and reading performance, we create a DataFrame with 1000,000 rows and 50 columns.

df = pd.DataFrame(
    [
        ["abc123"]*50,
    ] * 1000000,
    columns=[f"col_{i}" for i in range(1, 51)]
)

Writing files

Let’s compare the writing performance first.
To write the h5 file, we need to install the tables library.

pip install tables

To write the parquet file, we need to install pyarrow or fastparquet.
Here we will install the pyarrow library.

pip install pyarrow

Well, let’s start comparing.

%%timeit
df.to_csv("data/100000rows-data.csv", index=False)

11.7 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_pickle("data/100000rows-data.pkl")

876 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_hdf("data/100000rows-data.h5", key='df', mode='w')

1.94 s ± 93.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_parquet('data/100000rows-data.parquet.gzip', compression='gzip') 

4.24 s ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_feather('data/100000rows-data.feather')

2.13 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s see the result in a chart.
We will find CSV format takes the longest time and pickle takes the shortest time.
So if the output file format doesn’t matter such as intermediate results, we should use the pickle or the h5 file to speed up the writing.
Pandas optimizing

Reading files

Similarly, let’s compare the reading performance of the 5 file formats mentioned above.

%%timeit
pd.read_csv("data/100000rows-data.csv", header=0)

3.63 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_pickle("data/100000rows-data.pkl")

1.55 s ± 65.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_hdf("data/100000rows-data.h5")

1.91 s ± 70.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_parquet('data/100000rows-data.parquet.gzip')

2.61 s ± 56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_feather('data/100000rows-data.feather')

1.38 s ± 55.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s also use a chart to compare the results.
The pickle format also has the best performance.
So let’s use pickle whenever possible when reading and writing intermediate results after reading the original file such as the CSV file.

Pandas optimizing

Other techniques

Conclusion

  • We can convert the object type to suitable types to reduce the memory of DataFrame significantly.
  • Let’s use the vectorization as much as possible.
  • Let’s use pickle whenever possible if the file format doesn’t matter.
  • There are a lot of techniques to improve the performance.

Subscribe and be the FIRST reader of our latest articles

* indicates required

Contact us