Python floating number sum give different results
What is the problem?
As discussed in [1]
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv("https://github.com/pandas-dev/pandas/files/5750979/data_source.zip")
>>> df.head()
Unnamed: 0 cont rul_c
0 0 20 375.00000
1 1 20 1756.00000
2 2 20 74.00000
3 3 20 1007.00000
4 4 20 55.00000
>>> df.shape
(6000001, 3)
>>> df.reset_index(inplace = True)
>>> print ("Regular sum: %s\n" % df["rul_c"].sum())
Regular sum: 30880496049.429993
>>> print ("Regular sum on filtered column: %s\n" % df[df["cont"] == 20]["rul_c"].sum())
Regular sum on filtered column: 30880496049.429993
>>> print ("GroupBy sum:\n%s" % df.groupby("cont")["rul_c"].sum())
GroupBy sum:
cont
20 30880496049.43000
Name: rul_c, dtype: float64
>>> df.groupby("cont")["rul_c"].agg(np.sum)
cont
20 30880496049.43000
Name: rul_c, dtype: float64
From above, we can observe that the
df["rul_c"].sum() = 30880496049.429993
df.groupby("cont")["rul_c"].agg(np.sum) = 30880496049.43000
More explorations
>>> np.sum(df["rul_c"])
30880496049.429993
>>> df["rul_c"].sum()
30880496049.429993
>>> sum(df["rul_c"].values.tolist())
30880496049.45165
>>> math.fsum(df["rul_c"].values.tolist())
30880496049.43
Why
This should be related to floating point number as the floating point number is an approximation for each value, when accumerating different numbers, if the sum method is different, then the accuracy should not be exactly the same. Specifically:
For floating point numbers the numerical precision of sum (and
np.add.reduce
) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when noaxis
is given. Whenaxis
is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’smath.fsum
function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such asfloat32
, numerical errors can become significant. In such cases it can be advisable to use dtype=”float64” to use a higher precision for the output. From [2]
Based on above description, tried math.fsum, still have an not very accurate result
>>> import math
>>> df.groupby("cont")["rul_c"].agg(math.fsum)
cont
20 30880496049.43000
Name: rul_c, dtype: float64
Based on above analysis, that the result of “30880496049.43000” and “30880496049.429993” are pretty accurate. However, the “30880496049.45165” is not very accurate, which should be caused by “adding each number individually to the result causing rounding errors in every step”
Partial pairwise summation
Kahan summation
math fsum
The algorithm’s accuracy depends on IEEE-754 arithmetic guarantees and the typical case where the rounding mode is half-even. On some non-Windows builds, the underlying C library uses extended precision addition and may occasionally double-round an intermediate sum causing it to be off in its least significant bit.[3]
Python and np, pd versions
Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:50:38)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__version__
'1.20.3'
>>> pd.__version__
'1.3.2'
Take away
For a large number of floating point number sum, use pandas sum or np.sum or math.fsum. Avoid the plain python sum([…])
Reference
[1] https://github.com/pandas-dev/pandas/issues/38778
[2] https://numpy.org/doc/stable/reference/generated/numpy.sum.html