Hi,
I am trying to use a regular expression to match URIs, and running into some trouble when I want to specify that the URI includes a slash and the regex includes a character class. There are a few cases where this becomes relevant, but one very specific example is where a full URL is in the style of hxxp://<IP address>/<some string>/<some other string>. Usually the <some string> is not consistent, which is why I chose regex as the hammer instead of wildcards.
Just today I was looking for examples where the first <some string> was the string "dout". I know that my dataset includes entries like this:
The cool part is, this regex will match all of them:
http.uri==/.*/dout.*/
And this one will match the first example:
http.uri==/.*4/dout.*/
If I want to match all of the examples, in most tools I would use a character class of \d or [0-9] instead of just "4" or "8" or "12" before the /dout. Indeed, I tried these two search terms:
http.uri==/.*[0-9]/dout.*/
and
http.uri==/.*\d/dout.*/
In either case, I get this error:
Error: Parse error on line 1:
http.uri==/.*[0-9]/dout.*/
-------------------^
Expecting 'EOF', '&&', '||', ')', got 'ID'
The fun part is I can use character classes elsewhere in my same search, but not with the results I want because they don't guarantee that the numbers are immediately before the /dout:
http.uri==/.*[0-9]+.*/ && http.uri==/.*/dout.*/
The above examples are a niche case to demonstrate the limitation, while the more general issue is I cannot use character classes and a forward slash in the same regex. I did a little reading on the Lucene regex limitations but can't figure out what I'm doing wrong or if this is a Moloch/ES/Lucene limitation.
Any suggestions?