Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re for Apache log file format

21 views
Skip to first unread message

Sam Giraffe

unread,
Oct 8, 2013, 2:33:31 AM10/8/13
to pytho...@python.org
Hi,

I am trying to split up the re pattern for Apache log file format and seem to be having some trouble in getting Python to understand multi-line pattern:

#!/usr/bin/python

import re

#this is a single line
string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'

#trying to break up the pattern match for easy to read code
pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
                     r'(?P<ident>\-)\s+'
                     r'(?P<username>\-)\s+'
                     r'(?P<TZ>\[(.*?)\])\s+'
                     r'(?P<url>\"(.*?)\")\s+'
                     r'(?P<httpcode>\d{3})\s+'
                     r'(?P<size>\d+)\s+'
                     r'(?P<referrer>\"\")\s+'
                     r'(?P<agent>\((.*?)\))')

match = re.search(pattern, string)

if match:
    print match.group('ip')
else:
    print 'not found'

The python interpreter is skipping to the 'math = re.search' and then the 'if' statement right after it looks at the <ip>, instead of moving onto <ident> and so on.

mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
> /Users/user/Documents/Python/apache.py(3)<module>()
-> import re
(Pdb) n
> /Users/user/Documents/Python/apache.py(5)<module>()
-> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
(Pdb) n
> /Users/user/Documents/Python/apache.py(7)<module>()
-> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
(Pdb) n
> /Users/user/Documents/Python/apache.py(17)<module>()
-> match = re.search(pattern, string)
(Pdb)

Thank you.

Andreas Perstinger

unread,
Oct 8, 2013, 6:23:20 AM10/8/13
to pytho...@python.org
On 08.10.2013 08:33, Sam Giraffe wrote:
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
> 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> r'(?P<ident>\-)\s+'
> r'(?P<username>\-)\s+'
> r'(?P<TZ>\[(.*?)\])\s+'
> r'(?P<url>\"(.*?)\")\s+'
> r'(?P<httpcode>\d{3})\s+'
> r'(?P<size>\d+)\s+'
> r'(?P<referrer>\"\")\s+'
> r'(?P<agent>\((.*?)\))')

[SNIP]

> The python interpreter is skipping to the 'math = re.search' and then the
> 'if' statement right after it looks at the <ip>, instead of moving onto
> <ident> and so on.

I'm not sure if I understand your problem, but your regex pattern only
matches up to the size. When you look for the referrer, the pattern
expects two quotes but in your string you have "-" (quote, dash, quote).
Thus there is no match (i.e. "match" is None) and the if-statement will
print "not found".

Bye, Andreas

Neil Cerutti

unread,
Oct 8, 2013, 8:50:22 AM10/8/13
to
On 2013-10-08, Sam Giraffe <s...@giraffetech.biz> wrote:
>
> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem
> to be having some trouble in getting Python to understand multi-line
> pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
> 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> r'(?P<ident>\-)\s+'
> r'(?P<username>\-)\s+'
> r'(?P<TZ>\[(.*?)\])\s+'
> r'(?P<url>\"(.*?)\")\s+'
> r'(?P<httpcode>\d{3})\s+'
> r'(?P<size>\d+)\s+'
> r'(?P<referrer>\"\")\s+'
> r'(?P<agent>\((.*?)\))')

I recommend using the re.VERBOSE flag when explicating an re.
It'll make your life incrementally easier.

pattern = re.compile(
r"""(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+
(?P<ident>\-)\s+
(?P<username>\-)\s+
(?P<TZ>\[(.*?)\])\s+ # You can even insert comments.
(?P<url>\"(.*?)\")\s+
(?P<httpcode>\d{3})\s+
(?P<size>\d+)\s+
(?P<referrer>\"\")\s+
(?P<agent>\((.*?)\))""", re.VERBOSE)

--
Neil Cerutti

Denis McMahon

unread,
Oct 8, 2013, 11:48:38 AM10/8/13
to
On Mon, 07 Oct 2013 23:33:31 -0700, Sam Giraffe wrote:

> I am trying to split up the re pattern for Apache log file format and
> seem to be having some trouble in getting Python to understand
> multi-line pattern:

Aiui apache log format uses space as delimiter, encapsulates strings in
'"' characters, and uses '-' as an empty field.

So I think every element should match: (\S+|"[^"]+"|-) and there should
be \s+ between elements.

--
Denis McMahon, denismf...@gmail.com

Skip Montanaro

unread,
Oct 8, 2013, 11:59:51 AM10/8/13
to Denis McMahon, Python
> Aiui apache log format uses space as delimiter, encapsulates strings in
> '"' characters, and uses '-' as an empty field.

Specifying the field delimiter as a space, you might be able to use
the csv module to read these. I haven't done any Apache log file work
since long before the csv module was available, but it just might
work.

Skip

Cameron Simpson

unread,
Oct 8, 2013, 6:17:44 PM10/8/13
to Skip Montanaro, Python, Denis McMahon
You can definitely do this. I pull things out of apache log files
using awk in exactly this fashion. It does rely on each of the
"real" fields having a fixed number of "words" in it. You just stick
the fields back together again.

And also in Python.

I've got a merge-apache-logs script to read multiple logs, presumed
in time order, and produce a single output stream for passing to
log analysis tools:

https://bitbucket.org/cameron_simpson/css/src/tip/bin/merge-apache-logs

It is a bit of a hack, but useful.

It has an "aptime" function to pull and parse the time field from
the line which starts like this:

def aptime(logline, zones, defaultZone):
''' Compute a datetime object from the supplied Apache log line.
`defaultZone` is the timezone to use if it cannot be deduced.
'''
fields = logline.split()
if len(fields) < 5:
##warning("bad log line: %s", logline)
return None

dt = None
tzinfo = None

# try for desired "[DD/Mon/YYYY:HH:MM:SS +hhmm]" format
humantime, tzinfo = fields[3], fields[4]
if len(humantime) == 21 \
and humantime.startswith('[') \
and tzinfo.endswith(']'):
try:
dt = datetime.strptime(humantime, "[%d/%b/%Y:%H:%M:%S")
except ValueError, e:
dt = None
if dt is None:
tzinfo = None
else:
tzinfo = tzinfo[:-1]

and proceeeds otherwise (we have a few different log formats in play, alas).

So regexpas are not your only choice here, and possibly not even the best choice.

Cheers,
--
Cameron Simpson <c...@zip.com.au>

This is not a bug. It's just the way it works, and makes perfect sense.
- Tom Christiansen <tch...@jhereg.perl.com>
I like that line. I hope my boss falls for it.
- Chaim Frenkel <cha...@cris.com>

Piet van Oostrum

unread,
Oct 9, 2013, 1:33:14 PM10/9/13
to
Although you have written the regexp as a sequence of lines, in reality it is a single string, and therefore pdb will do only a single step, and not go into its "parts", which really are not parts.
>
> mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
>> /Users/user/Documents/Python/apache.py(3)<module>()
> -> import re
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(5)<module>()
> -> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-"
> "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(7)<module>()
> -> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(17)<module>()
> -> match = re.search(pattern, string)
> (Pdb)

Also as Andreas has noted the r'(?P<referrer>\"\")\s+' part is wrong. It should probably be
r'(?P<referrer>\".*?\")\s+'

And the r'(?P<agent>\((.*?)\))') will also not match as there is text outside the (). Should probably also be
r'(?P<agent>\".*?\")') or something like it.
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
0 new messages