Pandas float64 rounding issues

Adam

unread,

Jun 12, 2014, 11:33:35 AM6/12/14

to pyd...@googlegroups.com

Hello,

I'm using pandas datastructures in conjunction with other structures in a library for spectroscopy, so this may not be a pandas issue per-se, but I'm almost positive it is. Basically, I'm reading wavelengths from a csv file where they only go out to 2 decimal places (eg 450.05)

I read these into a dataframe, and then a store one column (with the same index) as a reference specturm. Thus, I have two dataframes (full spectra, single reference spectrum), with identical indicies, derived from the same original index. A lot is going on under the hood, but somehwere along the line, some of the wavelengths are being rounded differently. For example, here are the same three elements in two of the indexes:

[480.23 480.6 480.96] [480.23 480.59999999999997 480.96]

These are from Float64Index structures. I really have no clue if this discrepancy if occurring under the hood from anything I've done, or if it's perhaps an issue involving read_csv() and float64Index. Has anyone seen this type of problem before, and maybe could explain what the cause and resolution were in your case? I'm using 0.13.1

The real problem here is that when I add or subtract dataframes, these indicies are not aligned, so the result in NAN's

Thanks

Stephan Hoyer

unread,

Jun 12, 2014, 1:06:04 PM6/12/14

to pyd...@googlegroups.com

This almost certainly not pandas' fault, but rather a fact of life when doing math with floating point numbers. You are not guaranteed to get exactly the same result from applying mathematically equivalent operations (e.g., multiplying by 1.0 vs adding 0).

The simplest solution would be to always round your floats to the desired precision before putting them in an index (e.g., using np.around).

I do agree that this limits the utility of FloatIndex. More generally, I think this showcases the need for an Index class to represent discretizations of continuous variables, so you could query for 480.6 and it would return the items labeled by the range [480.4, 480.8]. Here's a GitHub issue where I elaborate on this more:

https://github.com/pydata/pandas/issues/5460#issuecomment-44474502

Cheers,

Stephan

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adam Hughes

unread,

Jun 12, 2014, 6:24:47 PM6/12/14

to pyd...@googlegroups.com

Thanks for clearing that up. I will stick with np.around(). Since it seems that the loss of precision is down at the very low decimal, I had naively assumed that Float64Index would kind of handle this; otherwise, like you said, is there really a benefit to having 64-bit floats as indicies? Like, what is a use case where indicies need that type of precision? I'm not a computer scientist, so there's probably many obvious reasons I'm overlooking.

Anyway, thanks for the help, it was a long-outstanding pain point that was hard for me to track and I'l be very happy to get fix of it.

Adam Hughes

unread,

Jun 12, 2014, 6:43:41 PM6/12/14

to pyd...@googlegroups.com

PS,

Anyone else get an error why trying to use np.round() or np.around() on a Float64Index, directly? Looks like they are looking for a method called "round" as in:

index.round()

Which raises an error for me.

Paul Hobson

unread,

Jun 17, 2014, 11:52:55 AM6/17/14

to pyd...@googlegroups.com

I think the issue is more with with Floats in general, not just 64-bit floats.

Adam Hughes

unread,

Jun 18, 2014, 1:09:22 AM6/18/14

to pyd...@googlegroups.com

Ya, Stephan clarified that for me. There is still a bug however with doing np.round on a floatindex

Reply all

Reply to author

Forward