Regex: Matching forward slashes in regular expressions

William Ste

unread,

Aug 5, 2014, 3:41:36 PM8/5/14

to moloc...@googlegroups.com

Hi,

I am trying to use a regular expression to match URIs, and running into some trouble when I want to specify that the URI includes a slash and the regex includes a character class. There are a few cases where this becomes relevant, but one very specific example is where a full URL is in the style of hxxp://<IP address>/<some string>/<some other string>. Usually the <some string> is not consistent, which is why I chose regex as the hammer instead of wildcards.

Just today I was looking for examples where the first <some string> was the string "dout". I know that my dataset includes entries like this:

hxxp://1.2.3.4/dout.asp?foo=bar

hxxp://5.6.7.8/dout.php?blah=gah

hxxp://9.10.11.12/dout/zigzag.html

The cool part is, this regex will match all of them:

http.uri==/.*/dout.*/

And this one will match the first example:

http.uri==/.*4/dout.*/

If I want to match all of the examples, in most tools I would use a character class of \d or [0-9] instead of just "4" or "8" or "12" before the /dout. Indeed, I tried these two search terms:

http.uri==/.*[0-9]/dout.*/

and

http.uri==/.*\d/dout.*/

In either case, I get this error:

Error: Parse error on line 1:
http.uri==/.*[0-9]/dout.*/
-------------------^
Expecting 'EOF', '&&', '||', ')', got 'ID'

The fun part is I can use character classes elsewhere in my same search, but not with the results I want because they don't guarantee that the numbers are immediately before the /dout:

http.uri==/.*[0-9]+.*/ && http.uri==/.*/dout.*/

The above examples are a niche case to demonstrate the limitation, while the more general issue is I cannot use character classes and a forward slash in the same regex.  I did a little reading on the Lucene regex limitations but can't figure out what I'm doing wrong or if this is a Moloch/ES/Lucene limitation.

Any suggestions?

Andy

unread,

Aug 5, 2014, 3:55:41 PM8/5/14

to moloc...@googlegroups.com

Looks like there are a few bugs here. In general you should backslash the forward slashes. That will fix the http.uri==/.*[0-9]\/dout.*/ example. However character classes are still broken.

https://github.com/aol/moloch/issues/281

William Ste

unread,

Aug 5, 2014, 5:09:56 PM8/5/14

to moloc...@googlegroups.com

Andy, thanks for opening the issue. I actually get the same parse error when I try to escape the forward slash with a backslash. Error below shows a double backslash, but search term only had a single: http.uri==/.*[0-9]\/dout.*/

I meant to include in my original post, this is on v0.10.1-GIT.

Error: Parse error on line 1:
...ttp.uri==/.*[0-9]\\/dout.*/
-----------------------^
Expecting 'EOF', '&&', '||', ')', got 'ID'

Andy

unread,

Aug 6, 2014, 10:26:31 AM8/6/14

to moloc...@googlegroups.com

Yes 0.10.1 is especially bad with slashes, 0.11.1-GIT is a little better, I'm going to try and look at this today

Andy

unread,

Aug 6, 2014, 3:01:00 PM8/6/14

to moloc...@googlegroups.com

Ok, after rereading the lucene docs again I was reminded that it doesn't support PCRE style character class shorthand. So \d won't work, you'll have to use [0-9]. If you upgrade to 0.11.1 the backslash issues should already be fixed. There is some strangeness where some regexps should be treated as invalid (example: /foo/foo/ should be an error, but /foo\/foo/ should work, which I'll fix in an upcoming version.

Andy

Reply all

Reply to author

Forward