contains / "in" function for strings and string columns

35 views
Skip to first unread message

mkeb...@gmail.com

unread,
Feb 11, 2014, 5:23:52 AM2/11/14
to num...@googlegroups.com
Hello,


Using recent Adam Ryan's patch for minimum and maximum functions I have created similar patch that implements "contains(str1, str2)" function for string columns, but I have trouble making it work:

https://gist.github.com/mrkafk/8932259


The patch: https://github.com/mrkafk/numexpr/commit/c89f632e5007d82abfc344652cc611df60f89047

I'm not even sure if this error originates in pytables or I need to make some correction in my numexpr patch. Please advise!

My ultimate goal is to add regex support for in-kernel searches in pytables actually, but I'd also like to make a simple (boolean) substring search first, as it's both useful in many cases and significantly faster than typical regex.



Gaëtan de Menten

unread,
Feb 11, 2014, 10:14:45 AM2/11/14
to num...@googlegroups.com
On Tue, Feb 11, 2014 at 11:23 AM, <mkeb...@gmail.com> wrote:

Using recent Adam Ryan's patch for minimum and maximum functions I have created similar patch that implements "contains(str1, str2)" function for string columns, but I have trouble making it work:

https://gist.github.com/mrkafk/8932259


The patch: https://github.com/mrkafk/numexpr/commit/c89f632e5007d82abfc344652cc611df60f89047

I have pointed a few problems with your code as inline comments...

I'm not even sure if this error originates in pytables or I need to make some correction in my numexpr patch. Please advise!

... however, since the error you have originates in pytables code (it does not seem like it even hits numexpr), this is probably not enough to make it work in your use case. I strongly advice you to first test your new operator using numexpr directly and only when that works try it via pytables.

My ultimate goal is to add regex support for in-kernel searches in pytables actually, but I'd also like to make a simple (boolean) substring search first, as it's both useful in many cases and significantly faster than typical regex.

Sounds cool...

Hope it helps,
Gaëtan

mkeb...@gmail.com

unread,
Feb 12, 2014, 3:25:24 PM2/12/14
to num...@googlegroups.com

Thanks a lot, Gaetan! Your comments helped me greatly with implementing this, here's the first draft:

https://github.com/mrkafk/numexpr/compare/pydata:master...master

I've added several tests and they pass.

However, "stringcontains" implementation in numexpr/interpreter.cpp is not quite optimal yet I think - it's brute-force approach with copying source and substring into additional areas of memory and then calling strstr on those. I guess since one of the important (main?) points of numexpr is keeping things in CPU cache as much as possible, malloc() and free() do not necessarily perform well here, do they? I'd have to add smth like Rabin-Karp implementation "inline" in that function I think, would that be optimum for performance?







Gaëtan de Menten

unread,
Feb 13, 2014, 3:14:07 AM2/13/14
to num...@googlegroups.com
On Wed, Feb 12, 2014 at 9:25 PM, <mkeb...@gmail.com> wrote:

Thanks a lot, Gaetan!

You are welcome...

Your comments helped me greatly with implementing this, here's the first draft:

https://github.com/mrkafk/numexpr/compare/pydata:master...master

I've added several tests and they pass.

Nice!

However, "stringcontains" implementation in numexpr/interpreter.cpp is not quite optimal yet I think - it's brute-force approach with copying source and substring into additional areas of memory and then calling strstr on those. I guess since one of the important (main?) points of numexpr is keeping things in CPU cache as much as possible, malloc() and free() do not necessarily perform well here, do they?

Indeed.

I'd have to add smth like Rabin-Karp implementation "inline" in that function I think, would that be optimum for performance?

I don't know anything about substring algorithms but not malloc'ing and copying the data is a must have.

-G.
Reply all
Reply to author
Forward
0 new messages