Optimization

2022-04-27 Pandas

In this article, let’s go through some optimization techniques in Pandas together.

Pandas optimizing

Data preparation

In order to produce sufficient difference after optimization, we need to prepare enough data.
Here, we create a DataFrame with 100,000 rows.

import pandas as pd

df = pd.DataFrame(
    [
        ["123", "abc", "123456.78", "2022-01-02 01:02:01"],
        ["234", "def", "223456.78", "2022-02-03 02:03:02"],
        ["567", "ghi", "323456.78", "2022-03-04 06:04:03"],
        ["890", "jkl", "423456.78", "2022-04-11 12:05:04"],
        ["123", "mno", "523456.78", "2022-05-25 19:06:05"],
    ] * 20000,
    columns=["col_1", "col_2", "col_3", "col_4"]
)
len(df)

100000
We generated 100 thousand rows.

Memory

Firstly, let’s go through some memory-related optimization techniques.
All the columns of the data prepared above are strings.
Memory usage:　 25.0MB.

df.info(memory_usage='deep')

Pandas optimizing

string -> datetime

Let’s convert the col_4 to the date-time type (Timestamp).
Memory usage: 　18.5MB 　
Reduction:　 6.5MB(26%)↓.

df["col_4"] = pd.to_datetime(df["col_4"])
df.info(memory_usage='deep')

Pandas optimizing

string -> number

Next, let’s convert col_1 to integer, and convert col_3 to float.
Memory usage: 　7.1MB 　
Reduction:　 17.9MB(71.6%)↓.

import numpy as np
df = df.astype({"col_1": np.int16, "col_3": np.float32})

df.info(memory_usage='deep')

Pandas optimizing

string -> category

We can convert the columns with a finite number of possible values to category type.
Memory usage: 　1.4MB 　
Reduction:　 23.6MB(94.4%)↓.

df["col_2"] = df["col_2"].astype("category")
df.info(memory_usage='deep')

Pandas optimizing

To recap, the memory size is reduced by 94.4% by converting data types.

For more about data type conversions, you can reference the article below.

Pandas » Data Type Conversions
https://thats-it-code.com/pandas/pandas__data-type-conversions/

Computing

For example, we want to add up col_1 and col_3 and assign the result to a new column named total.
We can achieve this in the following 5 ways.

General Row-by-Row Loop
iterrows()
itertuples()
apply()
vectorization

General Row-by-Row Loop

Firstly, let’s loop the data row by row and add col_1 and col_3.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []

# Calculating execution time  
%%timeit

for i in range(len(temp_df)):
    # Save the computing result of each row to a list
    new_col.append(temp_df.loc[i, "col_1"] + temp_df.loc[i, "col_3"])
# Assign the result to a new column named total  
temp_df["total"] = pd.Series(new_col)

1.52 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

iterrows()

Next, let’s use the iterrows() method of DataFrame to loop the data.
iterrows() will yield row index and row Series object each time.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []

%%timeit

for i, row in temp_df.iterrows():
    new_col.append(row["col_1"] + row["col_3"]) 
temp_df["total"] = pd.Series(new_col)

2.88 s ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

itertuples()

Next, let’s use the itertuples() method of DataFrame to loop the data.
itertuples() will iterate over DataFrame rows as namedtuples.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()
# List to store the computing results.  
new_col = []

%%timeit

# By default, index=True -> for index, row in ...:
for row in temp_df.itertuples(index=False):
     new_col.append(row[0] + row[2])
temp_df["total"] = pd.Series(new_col)

426 ms ± 138 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

apply()

We can also use apply() method of DataFrame to reach our goal.

# The function definition to be applied
def print_col(row):
    return row["col_1"] + row["col_3"]

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()

%%timeit
# axis=1 means apply this operation on all rows.
temp_df["total"] = temp_df.apply(print_col, axis=1)

817 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vectorization

Finally, let’s use the built-in method to add two columns.

# Copy the DataFrame to keep the original DataFrame unchanged
temp_df = df.copy()

%%timeit
# Use the built-in adding operation
temp_df["total"] = temp_df["col_1"] + temp_df["col_3"]

211 µs ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s summarize all the results.
As the test results show, the fasted way is the vectorization. So let’s use the built-in operations as much as possible.

If we have to loop row by row, itertuples() should be used first.
itertuples() is much faster than iterrows(). In this test, itertuples() is almost 7 times faster than iterrows().

apply() is slower than itertuples(), the Row-by-Row loop and iterrows() is very slow and we should avoid using them.

Pandas optimizing

I/O

Next, let’s compare the I/O performance of various file formats.
To compare the writing and reading performance, we create a DataFrame with 1000,000 rows and 50 columns.

df = pd.DataFrame(
    [
        ["abc123"]*50,
    ] * 1000000,
    columns=[f"col_{i}" for i in range(1, 51)]
)

Writing files

Let’s compare the writing performance first.
To write the h5 file, we need to install the tables library.

pip install tables

To write the parquet file, we need to install pyarrow or fastparquet.
Here we will install the pyarrow library.

pip install pyarrow

Well, let’s start comparing.

%%timeit
df.to_csv("data/100000rows-data.csv", index=False)

11.7 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_pickle("data/100000rows-data.pkl")

876 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_hdf("data/100000rows-data.h5", key='df', mode='w')

1.94 s ± 93.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_parquet('data/100000rows-data.parquet.gzip', compression='gzip')

4.24 s ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.to_feather('data/100000rows-data.feather')

2.13 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s see the result in a chart.
We will find CSV format takes the longest time and pickle takes the shortest time.
So if the output file format doesn’t matter such as intermediate results, we should use the pickle or the h5 file to speed up the writing.
Pandas optimizing

Reading files

Similarly, let’s compare the reading performance of the 5 file formats mentioned above.

%%timeit
pd.read_csv("data/100000rows-data.csv", header=0)

3.63 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_pickle("data/100000rows-data.pkl")

1.55 s ± 65.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_hdf("data/100000rows-data.h5")

1.91 s ± 70.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_parquet('data/100000rows-data.parquet.gzip')

2.61 s ± 56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.read_feather('data/100000rows-data.feather')

1.38 s ± 55.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let’s also use a chart to compare the results.
The pickle format also has the best performance.
So let’s use pickle whenever possible when reading and writing intermediate results after reading the original file such as the CSV file.

Pandas optimizing

Other techniques

Like SQL, filter the data first and then merge(join) them
Use the chunksize parameter in read_csv() to read a large CSV file
Use multiprocessing.Pool to implement parallel processing
Use Numba

Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran, by decorating your function with @jit.
https://pandas.pydata.org/docs/user_guide/enhancingperf.html#numba-jit-compilation
Use Dask to process large datasets in parallel

A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index.
https://docs.dask.org/en/stable/dataframe.html
Use Pandas API on Spark to run pandas code on Spark cluster

Pandas API on Spark
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

Conclusion

We can convert the object type to suitable types to reduce the memory of DataFrame significantly.
Let’s use the vectorization as much as possible.
Let’s use pickle whenever possible if the file format doesn’t matter.
There are a lot of techniques to improve the performance.

That's it ! Code Snippets

Pandas >> Optimization

Table of Contents

Data preparation

Memory

string -> datetime

string -> number

string -> category

Computing

General Row-by-Row Loop

iterrows()

itertuples()

apply()

vectorization

I/O

Writing files

Reading files

Other techniques

Conclusion

Contact us

Pandas >> Optimization

Table of Contents

Data preparation

Memory

string -> datetime

string -> number

string -> category

Computing

General Row-by-Row Loop

iterrows()

itertuples()

apply()

vectorization

I/O

Writing files

Reading files

Other techniques

Conclusion

Subscribe and be the FIRST reader of our latest articles

Contact us