Dear all,
I am using Pandas to handle huge amount of data with Python. Most of the times it works like a charm. Thank you for that.
Recently, however, I discovered a performance issue, which happend after upgrading my Python packages (amongst the other upgrades pandas v0.18.1 --> v0.19.1 and numpy v1.11.1 --> v1.12.0).
My application was so significantly slower after this upgrade and I could locate the performance decrease to .ix and .loc calls. Some of those .ix and .loc calls took about a second.
More particularly, I use .ix and/or .loc to fetch rows from a dataframe df, which has md5 hash values in the index row:
Although, I believe that the size of the dataframe is not causative for the issue, I want to mention that the dataframe is a couple of GB large with tens of millions of rows.
Downgrading both, pandas and numpy to their previous version fixed the problem for me (Note: I only downgraded those two packages).
Another issue I noticed, that changing the dtype of (any of the) columns causes .loc and .ix also to be slow. When I changed the dtype from object to category or bool, for instance.
Are those problems know issues? I could not find anything related to the current versions of the packages. I solved the problem for me by downgrading the packages.
Nevertheless, I thought this might be something you should know about.
Again I want to say Thank You for the wonderful work you've done here! It eased my work a lot!
Best Regards!