Regex: Matching forward slashes in regular expressions

793 views
Skip to first unread message

William Ste

unread,
Aug 5, 2014, 3:41:36 PM8/5/14
to moloc...@googlegroups.com
Hi,
I am trying to use a regular expression to match URIs, and running into some trouble when I want to specify that the URI includes a slash and the regex includes a character class.  There are a few cases where this becomes relevant, but one very specific example is where a full URL is in the style of hxxp://<IP address>/<some string>/<some other string>.  Usually the <some string> is not consistent, which is why I chose regex as the hammer instead of wildcards.
Just today I was looking for examples where the first <some string> was the string "dout".  I know that my dataset includes entries like this:


The cool part is, this regex will match all of them:
http.uri==/.*/dout.*/

And this one will match the first example:
http.uri==/.*4/dout.*/

If I want to match all of the examples, in most tools I would use a character class of \d or [0-9] instead of just "4" or "8" or "12" before the /dout.  Indeed, I tried these two search terms:
http.uri==/.*[0-9]/dout.*/
and
http.uri==/.*\d/dout.*/

In either case, I get this error:
Error: Parse error on line 1:
http.uri==/.*[0-9]/dout.*/
-------------------^
Expecting 'EOF', '&&', '||', ')', got 'ID'


The fun part is I can use character classes elsewhere in my same search, but not with the results I want because they don't guarantee that the numbers are immediately before the /dout:
http.uri==/.*[0-9]+.*/ && http.uri==/.*/dout.*/

The above examples are a niche case to demonstrate the limitation, while the more general issue is I cannot use character classes and a forward slash in the same regex.  I did a little reading on the Lucene regex limitations but can't figure out what I'm doing wrong or if this is a Moloch/ES/Lucene limitation.

Any suggestions? 

Andy

unread,
Aug 5, 2014, 3:55:41 PM8/5/14
to moloc...@googlegroups.com
Looks like there are a few bugs here.  In general you should backslash the forward slashes.  That will fix the http.uri==/.*[0-9]\/dout.*/ example.  However character classes are still broken.

William Ste

unread,
Aug 5, 2014, 5:09:56 PM8/5/14
to moloc...@googlegroups.com
Andy, thanks for opening the issue.  I actually get the same parse error when I try to escape the forward slash with a backslash.  Error below shows a double backslash, but search term only had a single: http.uri==/.*[0-9]\/dout.*/

I meant to include in my original post, this is on v0.10.1-GIT.

Error: Parse error on line 1:
...ttp.uri==/.*[0-9]\\/dout.*/
-----------------------^
Expecting 'EOF', '&&', '||', ')', got 'ID'

Andy

unread,
Aug 6, 2014, 10:26:31 AM8/6/14
to moloc...@googlegroups.com
Yes 0.10.1 is especially bad with slashes, 0.11.1-GIT is a little better, I'm going to try and look at this today

Andy

unread,
Aug 6, 2014, 3:01:00 PM8/6/14
to moloc...@googlegroups.com
Ok, after rereading the lucene docs again I was reminded that  it doesn't support PCRE style character class shorthand.  So \d won't work, you'll have to use [0-9].  If you upgrade to 0.11.1 the backslash issues should already be fixed.  There is some strangeness where some regexps should be treated as invalid (example: /foo/foo/ should be an error, but /foo\/foo/ should work, which I'll fix in an upcoming version.

Andy
Reply all
Reply to author
Forward
0 new messages