Pandas internal

A deep understanding about Pandas

Jimmy (xiaoke) Shen
3 min readMar 3, 2023

What is pandas

Pandas is a popular package used for processing data. In this article, some useful tutorials regarding the pandas internals will be introduced to help Data Scientis or Engineers to have a deeper understanding about this package.

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal. From [1]

Source code[1]

As of 02/07/2023 it has 36.8K starts, pretty good! From [1]

Most of the source code is in Python

Source code distribution From [1]

An example to explore the dataframe of Pandas

This example is slightly modified from [2]

Basic checking

>>> import pandas as pd
>>> df = pd.DataFrame({"foo": [1, 3, 7], "bar": [.55, .768, .90] , "foobar": [100, 230, 450]})
>>> print(df.dtypes)
foo int64
bar float64
foobar int64
dtype: object
>>> df
foo bar foobar
0 1 0.550 100
1 3 0.768 230
2 7 0.900 450

More advanced checking

>>> df._data
BlockManager
Items: Index(['foo', 'bar', 'foobar'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 2, 1), 1 x 3, dtype: float64
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int64
>>> df._data.blocks
(NumericBlock: slice(1, 2, 1), 1 x 3, dtype: float64, NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int64)
>>> df._data.blocks[1].values
array([[ 1, 3, 7],
[100, 230, 450]])
>>> type(df._data.blocks[1].values)
<class 'numpy.ndarray'>
>>> df._data.blocks[1].values.data
<memory at 0x7f899145fad0>

Even more checking

>>> bytes_ = df._data.blocks[1].values.data.tobytes()
>>> bytes_
b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00\x00\xe6\x00\x00\x00\x00\x00\x00\x00\xc2\x01\x00\x00\x00\x00\x00\x00'
>>> print("".join("{:08b}".format(byte) for byte in bytes_))
000000010000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000000000000000000000000011100000000000000000000000000000000000000000000000000000000011001000000000000000000000000000000000000000000000000000000000011100110000000000000000000000000000000000000000000000000000000001100001000000001000000000000000000000000000000000000000000000000
>>> len(bytes_)
48
>>> df._data.blocks[1].values.strides
(24, 8)
Image is From [3]

When looking into the internal of how pandas data being saved, one very nice illustration can be found in the above figure.

If you wanna to understand how Pandas organize the data:

Under the hood, pandas groups the columns into blocks of values of the same type. Here’s a preview of how pandas stores the first twelve columns of our dataframe. From [3]

Each type has a specialized class in the pandas.core.internals module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast. From [3]

Reference

[1] source code

[2] Demystifying pandas internals — Marc Garcia

[3] Tutorial: Using Pandas with Large Data Sets in Python

[4] Why Python is Slow: Looking Under the Hood

[5] Jeffrey Tratner: Pandas Under The Hood: Peeking behind the scenes of a high performance data analys

[6] Stephen Simmons | Pandas from the Inside

--

--

No responses yet