Pandas Fast or Slow?
Pandas is a commonly used package for data analysis. A clean understanding of whether pandas is fast or slow will be helpful to better utilize this package.
Pandas is fast
As pandas is built on Numpy, so it is usually faster than plain python code.
Pandas is slow
Pands is pretty fast when comparing with Python, however, when comparing with Numpy, it can be slow. Detailed comparison can be found in [1]. Here I’d like to cite some key discoveries from [1]:
df.iterrows is slower than iterating df.values[1][2]
def iterrows_function(df):
for index, row in df.iterrows():
pass
def itertuples_function(df):
for row in df.itertuples():
pass
def df_values(df):
for row in df.values:
pass
It is further verified from [2]:
why?
From [2]
The first approach
[sum_square(row[0], row[1]) for _, row in df.iterrows()]
uses list comprehension along with the method
iterrows
, and is the slowest by a long shot. This is because it is effectively using a simple for loop andincurring the heavy overhead of using the pandas series object in each iteration. It is rarely necessary to use series object in your transformation function, so you should almost never use theiterrows
method. From [2]
numpy sort is faster than pandas sort[1]
def pandas_sort(df):
return df["A"].sort_values()
def numpy_sort(df):
return np.sort(df["A"])
def numpy_values_sort(df):
return np.sort(df["A"].values)
Why numpy is so fast?[3]
From [3], simple adding two list, numpy seems in O(1), however, regular loops has the complexity of O(n).
Some of the obvious answers include:
Numpy is primarily written in C, which is faster than Python.
Numpy arrays are homogeneous (all array elements have fixed data-type-
np.float32, np.uint8
, etc. compared to python lists that have no such restriction), thus allowing numbers to be stored in contiguous memory locations for faster access (exploiting locality of reference)Still the above reasons aren’t enough to explain how the processing time doesn’t scale with array size. [3]
The secret is SIMD, which is short for singly instruction multiple data. A quick summary is if we can do things in parallel, do it. If not, do it sequentially. This explains why the addition is in O(1) instead of O(n). Details are below:
1. Identify whether it’s a reduce operation — combining the input arguments & returning a single aggregated result (which it’s not in our case)
2. Try adding them using
run_binary_simd_add_FLOAT()
(‘binary’ here means an operation on 2 inputs, can be arrays or scalars or a combination of both)3. If the call to the simd_add function fails, then it uses a standard loopy element-wise addition. [3]
Reference
[1] https://github.com/mm-mansour/Fast-Pandas
[2] https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c
[4] https://medium.com/geekculture/simple-tricks-to-speed-up-pandas-by-100x-3b7e705783a8
[5] https://medium.com/@tommerrissin/is-pandas-really-that-slow-cff4352e4f58
[6] Sofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017