Python floating number sum give different results

3 min readFeb 17, 2023

What is the problem?

As discussed in [1]

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv("https://github.com/pandas-dev/pandas/files/5750979/data_source.zip")
>>> df.head()
   Unnamed: 0  cont      rul_c
0           0    20  375.00000
1           1    20 1756.00000
2           2    20   74.00000
3           3    20 1007.00000
4           4    20   55.00000
>>> df.shape
(6000001, 3)
>>> df.reset_index(inplace = True)
>>> print ("Regular sum: %s\n" % df["rul_c"].sum())
Regular sum: 30880496049.429993

>>> print ("Regular sum on filtered column: %s\n" % df[df["cont"] == 20]["rul_c"].sum())
Regular sum on filtered column: 30880496049.429993

>>> print ("GroupBy sum:\n%s" % df.groupby("cont")["rul_c"].sum())
GroupBy sum:
cont
20   30880496049.43000
Name: rul_c, dtype: float64
>>> df.groupby("cont")["rul_c"].agg(np.sum)
cont
20   30880496049.43000
Name: rul_c, dtype: float64

From above, we can observe that the

df["rul_c"].sum() = 30880496049.429993
df.groupby("cont")["rul_c"].agg(np.sum) = 30880496049.43000

More explorations

>>> np.sum(df["rul_c"])
30880496049.429993
>>> df["rul_c"].sum()
30880496049.429993
>>> sum(df["rul_c"].values.tolist())
30880496049.45165
>>> math.fsum(df["rul_c"].values.tolist())
30880496049.43

Why

This should be related to floating point number as the floating point number is an approximation for each value, when accumerating different numbers, if the sum method is different, then the accuracy should not be exactly the same. Specifically:

For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’s math.fsum function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such as float32, numerical errors can become significant. In such cases it can be advisable to use dtype=”float64” to use a higher precision for the output. From [2]

Based on above description, tried math.fsum, still have an not very accurate result

>>> import math
>>> df.groupby("cont")["rul_c"].agg(math.fsum)
cont
20   30880496049.43000
Name: rul_c, dtype: float64

Based on above analysis, that the result of “30880496049.43000” and “30880496049.429993” are pretty accurate. However, the “30880496049.45165” is not very accurate, which should be caused by “adding each number individually to the result causing rounding errors in every step”

Partial pairwise summation

Kahan summation

math fsum

The algorithm’s accuracy depends on IEEE-754 arithmetic guarantees and the typical case where the rounding mode is half-even. On some non-Windows builds, the underlying C library uses extended precision addition and may occasionally double-round an intermediate sum causing it to be off in its least significant bit.[3]

Python and np, pd versions

Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:50:38) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__version__
'1.20.3'
>>> pd.__version__
'1.3.2'