Pandas internal
A deep understanding about Pandas
What is pandas
Pandas is a popular package used for processing data. In this article, some useful tutorials regarding the pandas internals will be introduced to help Data Scientis or Engineers to have a deeper understanding about this package.
pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal. From [1]
Source code[1]
Most of the source code is in Python
An example to explore the dataframe of Pandas
This example is slightly modified from [2]
Basic checking
>>> import pandas as pd
>>> df = pd.DataFrame({"foo": [1, 3, 7], "bar": [.55, .768, .90] , "foobar": [100, 230, 450]})
>>> print(df.dtypes)
foo int64
bar float64
foobar int64
dtype: object
>>> df
foo bar foobar
0 1 0.550 100
1 3 0.768 230
2 7 0.900 450
More advanced checking
>>> df._data
BlockManager
Items: Index(['foo', 'bar', 'foobar'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 2, 1), 1 x 3, dtype: float64
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int64
>>> df._data.blocks
(NumericBlock: slice(1, 2, 1), 1 x 3, dtype: float64, NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int64)
>>> df._data.blocks[1].values
array([[ 1, 3, 7],
[100, 230, 450]])
>>> type(df._data.blocks[1].values)
<class 'numpy.ndarray'>
>>> df._data.blocks[1].values.data
<memory at 0x7f899145fad0>
Even more checking
>>> bytes_ = df._data.blocks[1].values.data.tobytes()
>>> bytes_
b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00d\x00\x00\x00\x00\x00\x00\x00\xe6\x00\x00\x00\x00\x00\x00\x00\xc2\x01\x00\x00\x00\x00\x00\x00'
>>> print("".join("{:08b}".format(byte) for byte in bytes_))
000000010000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000000000000000000000000011100000000000000000000000000000000000000000000000000000000011001000000000000000000000000000000000000000000000000000000000011100110000000000000000000000000000000000000000000000000000000001100001000000001000000000000000000000000000000000000000000000000
>>> len(bytes_)
48
>>> df._data.blocks[1].values.strides
(24, 8)
When looking into the internal of how pandas data being saved, one very nice illustration can be found in the above figure.
If you wanna to understand how Pandas organize the data:
Under the hood, pandas groups the columns into blocks of values of the same type. Here’s a preview of how pandas stores the first twelve columns of our dataframe. From [3]
Each type has a specialized class in the
pandas.core.internals
module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast. From [3]
Reference
[1] source code
[2] Demystifying pandas internals — Marc Garcia
[3] Tutorial: Using Pandas with Large Data Sets in Python
[4] Why Python is Slow: Looking Under the Hood
[5] Jeffrey Tratner: Pandas Under The Hood: Peeking behind the scenes of a high performance data analys