Vastly speeding up .isin() method

ter...@gmail.com

unread,

May 17, 2016, 4:56:10 PM5/17/16

to PyData

The Series method .isin is much slower than normal Pandas Series filtering.

Normal filtering:

%timeit df.A == 'a'
1000 loops, best of 3: 294 µs per loop

While a similar .isin method with two filter values is much much slower:

%timeit df.A.isin(['a', 'b'])
10 loops, best of 3: 28.2 ms per loop

This is an unreasonable difference.

Compare with the method below, which is functionally equivalent to the .isin method:

from functools import reduce
from operator import or_
%timeit reduce( or_, (df.A==i for i in ('a', 'b')))
1000 loops, best of 3: 1.61 ms per loo

This is much faster than .isin in its present form. Maybe this could be further improved by using Cython?

So, could .isin be replaced by a method based on the above faster method?

Regard
Terji

Stephan Hoyer

unread,

May 17, 2016, 9:43:18 PM5/17/16

to pyd...@googlegroups.com

Please share the complete setup for your example data -- I'm having trouble reproducing this.

Keep in mind that the important cases to optimize is looking up a large number of elements in a large DataFrame. That said, we are always looking to improve performance. A GitHub issue would be a good place to continue this discussion.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miki Tebeka

unread,

May 18, 2016, 1:24:49 AM5/18/16

to PyData

List is not the best data structure to check for inclusion. Switching to set (%timeit df.A.isin({'a', 'b'})) reduced the time from 189ms to 738us on my machine (%timeit df.A == 'a' took 644us)

ter...@gmail.com

unread,

May 18, 2016, 12:40:54 PM5/18/16

to PyData

Sorry, I'll post a complete example. I haven't tried starting Github issues, but I could take it there. Anayway, an example is:

from functools import reduce
from operator import or_

import pandas as pd

s = pd.Series(pd.np.random.randint(100, size=1e6))

%timeit s.isin({1,2})
10 loops, best of 3: 96.2 ms per loop

%timeit reduce(or_, (s==i for i in {1,2}))
100 loops, best of 3: 4.79 ms per loop

For smaller datasets the reduce method isn't faster, but might be slower. E.g at size 1000, .isin is actually faster for me.

Regards

Reply all

Reply to author

Forward