Re for Apache log file format

Sam Giraffe

unread,

Oct 8, 2013, 2:33:31 AM10/8/13

to pytho...@python.org

Hi,

I am trying to split up the re pattern for Apache log file format and seem to be having some trouble in getting Python to understand multi-line pattern:

#!/usr/bin/python

import re

#this is a single line

string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'

#trying to break up the pattern match for easy to read code

pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
                     r'(?P<ident>\-)\s+'
                     r'(?P<username>\-)\s+'
                     r'(?P<TZ>\[(.*?)\])\s+'
                     r'(?P<url>\"(.*?)\")\s+'
                     r'(?P<httpcode>\d{3})\s+'
                     r'(?P<size>\d+)\s+'
                     r'(?P<referrer>\"\")\s+'
                     r'(?P<agent>$(.*?)$)')

match = re.search(pattern, string)

if match:
    print match.group('ip')
else:
    print 'not found'

The python interpreter is skipping to the 'math = re.search' and then the 'if' statement right after it looks at the <ip>, instead of moving onto <ident> and so on.

mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
> /Users/user/Documents/Python/apache.py(3)<module>()
-> import re
(Pdb) n
> /Users/user/Documents/Python/apache.py(5)<module>()
-> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
(Pdb) n
> /Users/user/Documents/Python/apache.py(7)<module>()
-> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
(Pdb) n
> /Users/user/Documents/Python/apache.py(17)<module>()
-> match = re.search(pattern, string)
(Pdb)

Thank you.

Andreas Perstinger

unread,

Oct 8, 2013, 6:23:20 AM10/8/13

to pytho...@python.org

On 08.10.2013 08:33, Sam Giraffe wrote:
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
> 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> r'(?P<ident>\-)\s+'
> r'(?P<username>\-)\s+'
> r'(?P<TZ>\[(.*?)\])\s+'
> r'(?P<url>\"(.*?)\")\s+'
> r'(?P<httpcode>\d{3})\s+'
> r'(?P<size>\d+)\s+'
> r'(?P<referrer>\"\")\s+'
> r'(?P<agent>$(.*?)$)')

[SNIP]

> The python interpreter is skipping to the 'math = re.search' and then the
> 'if' statement right after it looks at the <ip>, instead of moving onto
> <ident> and so on.

I'm not sure if I understand your problem, but your regex pattern only
matches up to the size. When you look for the referrer, the pattern
expects two quotes but in your string you have "-" (quote, dash, quote).
Thus there is no match (i.e. "match" is None) and the if-statement will
print "not found".

Bye, Andreas

Neil Cerutti

unread,

Oct 8, 2013, 8:50:22 AM10/8/13

to

On 2013-10-08, Sam Giraffe <s...@giraffetech.biz> wrote:
>
> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem
> to be having some trouble in getting Python to understand multi-line
> pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
> 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> r'(?P<ident>\-)\s+'
> r'(?P<username>\-)\s+'
> r'(?P<TZ>\[(.*?)\])\s+'
> r'(?P<url>\"(.*?)\")\s+'
> r'(?P<httpcode>\d{3})\s+'
> r'(?P<size>\d+)\s+'
> r'(?P<referrer>\"\")\s+'
> r'(?P<agent>$(.*?)$)')

I recommend using the re.VERBOSE flag when explicating an re.
It'll make your life incrementally easier.

pattern = re.compile(
r"""(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+
(?P<ident>\-)\s+
(?P<username>\-)\s+
(?P<TZ>\[(.*?)\])\s+ # You can even insert comments.

(?P<url>\"(.*?)\")\s+

(?P<httpcode>\d{3})\s+

(?P<size>\d+)\s+
(?P<referrer>\"\")\s+
(?P<agent>$(.*?)$)""", re.VERBOSE)

--
Neil Cerutti

Denis McMahon

unread,

Oct 8, 2013, 11:48:38 AM10/8/13

to

On Mon, 07 Oct 2013 23:33:31 -0700, Sam Giraffe wrote:

> I am trying to split up the re pattern for Apache log file format and
> seem to be having some trouble in getting Python to understand
> multi-line pattern:

Aiui apache log format uses space as delimiter, encapsulates strings in
'"' characters, and uses '-' as an empty field.

So I think every element should match: (\S+|"[^"]+"|-) and there should
be \s+ between elements.

--
Denis McMahon, denismf...@gmail.com

Skip Montanaro

unread,

Oct 8, 2013, 11:59:51 AM10/8/13

to Denis McMahon, Python

> Aiui apache log format uses space as delimiter, encapsulates strings in
> '"' characters, and uses '-' as an empty field.

Specifying the field delimiter as a space, you might be able to use
the csv module to read these. I haven't done any Apache log file work
since long before the csv module was available, but it just might
work.

Skip

Cameron Simpson

unread,

Oct 8, 2013, 6:17:44 PM10/8/13

to Skip Montanaro, Python, Denis McMahon

You can definitely do this. I pull things out of apache log files
using awk in exactly this fashion. It does rely on each of the
"real" fields having a fixed number of "words" in it. You just stick
the fields back together again.

And also in Python.

I've got a merge-apache-logs script to read multiple logs, presumed
in time order, and produce a single output stream for passing to
log analysis tools:

https://bitbucket.org/cameron_simpson/css/src/tip/bin/merge-apache-logs

It is a bit of a hack, but useful.

It has an "aptime" function to pull and parse the time field from
the line which starts like this:

def aptime(logline, zones, defaultZone):
''' Compute a datetime object from the supplied Apache log line.
`defaultZone` is the timezone to use if it cannot be deduced.
'''
fields = logline.split()
if len(fields) < 5:
##warning("bad log line: %s", logline)
return None

dt = None
tzinfo = None

# try for desired "[DD/Mon/YYYY:HH:MM:SS +hhmm]" format
humantime, tzinfo = fields[3], fields[4]
if len(humantime) == 21 \
and humantime.startswith('[') \
and tzinfo.endswith(']'):
try:
dt = datetime.strptime(humantime, "[%d/%b/%Y:%H:%M:%S")
except ValueError, e:
dt = None
if dt is None:
tzinfo = None
else:
tzinfo = tzinfo[:-1]

and proceeeds otherwise (we have a few different log formats in play, alas).

So regexpas are not your only choice here, and possibly not even the best choice.

Cheers,
--
Cameron Simpson <c...@zip.com.au>

This is not a bug. It's just the way it works, and makes perfect sense.
- Tom Christiansen <tch...@jhereg.perl.com>
I like that line. I hope my boss falls for it.
- Chaim Frenkel <cha...@cris.com>

Piet van Oostrum

unread,

Oct 9, 2013, 1:33:14 PM10/9/13

to

Although you have written the regexp as a sequence of lines, in reality it is a single string, and therefore pdb will do only a single step, and not go into its "parts", which really are not parts.

>
> mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
>> /Users/user/Documents/Python/apache.py(3)<module>()
> -> import re
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(5)<module>()
> -> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-"
> "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(7)<module>()
> -> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(17)<module>()
> -> match = re.search(pattern, string)
> (Pdb)

Also as Andreas has noted the r'(?P<referrer>\"\")\s+' part is wrong. It should probably be
r'(?P<referrer>\".*?\")\s+'

And the r'(?P<agent>$(.*?)$)') will also not match as there is text outside the (). Should probably also be
r'(?P<agent>\".*?\")') or something like it.
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]