Problem matching in log file

22 views
Skip to first unread message

kio...@gmail.com

unread,
May 15, 2015, 1:50:03 PM5/15/15
to modgr...@googlegroups.com
Hi,

I'm using modgrammar to parse some log file (log4j), my idea is use the grammar to get the parts I'm interested and ignore the rest.

The log file lines look like:
2015-05-14 18:05:35,280 (key:value some nested info) DEBUG Category:371 - Information

The problem is that the last part (the "Information") could have more than one line of information. Normally is some text + xml

For example:
2015-05-14 18:05:35,280 (key:value some nested info) DEBUG Category:371 - Only one line
2015-05-14 18:05:36,280 (key:value some nested info) DEBUG Category:371 - Multi line <node>
    <subnode>some text</subnode>
</node>
2015-05-14 18:05:37,280 (key:value some nested info) DEBUG Category:371 - Other one liner

My grammars are:
grammar_whitespace_mode = 'explicit'

class YYYYMMDD(Grammar):
    #date
    grammar = (WORD('[0-9]', min=4, max=4), L('-'), WORD('[0-9]', min=2, max=2), L('-'),
               WORD('[0-9]', min=2, max=2))

class HHMMSSsss(Grammar): 
    #time
    grammar = (WORD('[0-9]', min=2, max=2), L(':'), WORD('[0-9]', min=2, max=2), L(':'),
               WORD('[0-9]', min=2, max=2), L(','), WORD('[0-9]', min=1, max=3))

class LogDate(Grammar):
    # YYYY-MM-DD ' ' HH:MM:SS.sss
    grammar = (YYYYMMDD, L(' '), HHMMSSsss)

class LogMDC(Grammar):
    #key:value
    grammar = (WORD('a-zA-Z0-9_\-'), L(':'), REPEAT(ANY, min=0, greedy=False))

class LogNDC(Grammar):
    grammar = (REPEAT(ANY_EXCEPT(' '), min=1))

class LogLevel(Grammar):
    grammar = (L('FATAL') | L('ERROR') | L('WARN') | L('INFO') | L('DEBUG') | L('TRACE'))

class LogCategory(Grammar):
    grammar = (WORD('[a-zA-Z0-9_\-\.]', min=1))

class LogLineNumber(Grammar):
    grammar = (WORD('[0-9]', min=1))

class Unknow(Grammar):
    grammar = (REPEAT(ANY, sep=EOL, min=1, greedy=False))

# there are some other grammars to match know information format

class LogLine(Grammar):
    grammar = (BOL, LogDate, L(' ('), REPEAT(LogMDC, min=0, sep=" "), REPEAT(LogNDC, min=0, sep=" "), L(') '), LogLevel,
               L(' '), LogCategory, L(':'), LogLineNumber, L(' - '), Unknow, EOL)

class LogFile(Grammar):

    grammar = REPEAT(LogLine)

I'm using LogFile parser.
The problem is that the Unknow grammar match only one line.
In the previous example the match is:
 1) "Only one line", this is OK
 2) Error because "    <subnode>some text</subnode>" doesn't match the grammar (modgrammar.ParseError: [line 3, column 1] Expected WORD('[0-9]') or beginning of line: Found '    <subnode>som')

How can I make the grammar match everything but not the next valid LogLine?

The run code:
with open('/tmp/x', encoding='latin1') as f:
    parser = LogFile.parser()
    r = parser.parse_file(f, bol=True)
         for e1 in r:
             print(e1)


Thanks!

Alex Stewart

unread,
May 19, 2015, 4:17:59 PM5/19/15
to modgr...@googlegroups.com
I had to think about this one for a little bit, but my best suggestion is that you might want to try taking advantage of the NOT_FOLLOWED_BY predicate, as follows:

(Note: I haven't tested any of this and it's all off the top of my head, so it might need some tweaking)

class ExtraLine (Grammar):
    grammar = (NOT_FOLLOWED_BY(LogLine), REST_OF_LINE, EOL)

class LogEntry (Grammar):
    grammar = (LogLine, ZERO_OR_MORE(ExtraLine))

class LogFile (Grammar):
    grammar = REPEAT(LogEntry)

(The name NOT_FOLLOWED_BY is perhaps a bit misleading for this application, but basically what it does is check to make sure that the text starting at the current position (whatever that position is) does not match the specified grammar, and only if that's the case allows the grammar matching to continue on, so ExtraLine will first check to make sure that what's at the current position does not look like a LogLine, and if that's the case, it will then proceed to return a successful match consisting of the whole line (REST_OF_LINE + EOL))

The one drawback with doing things this way is that it will slow down parsing, because it will essentially end up parsing each LogLine twice (once to tell that the previous LogEntry is done, and then a second time when it starts the next LogEntry).  You don't necessarily need to parse the whole line, though, if you can be sure that, for example, an ExtraLine will never start with a date-time stamp.  In that case, you could reduce the amount of matching overhead required by just making sure the line doesn't start with LogDate (instead of trying to parse out a whole LogLine):

class ExtraLine (Grammar):
    grammar = (NOT_FOLLOWED_BY(LogDate), REST_OF_LINE, EOL)

etc..

--Alex

--
You received this message because you are subscribed to the Google Groups "modgrammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modgrammar+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gerardo Martín Chaves

unread,
May 20, 2015, 10:14:19 AM5/20/15
to modgr...@googlegroups.com
Thanks!

I have made the next changes to the original grammar:

class ExtraLine(Grammar):
    grammar = (NOT_FOLLOWED_BY(BOL, YYYYMMDD), REST_OF_LINE, EOL)

class Unknow(Grammar):
    grammar = (REST_OF_LINE, EOL, ZERO_OR_MORE(ExtraLine))

#for the record only: added a space in WARN and INFO, because log4j always return 5 chars on our configuration
class LogLevel(Grammar):
    grammar = (L('FATAL') | L('ERROR') | L(' WARN') | L(' INFO') | L('DEBUG') | L('TRACE'))

#removed EOL at the end because Unknow grammar take care of it
class LogLine(Grammar):
    grammar = (BOL, LogDate, L(' ('), REPEAT(LogMDC, min=0, sep=" "), REPEAT(LogNDC, min=0, sep=" "), L(') '), LogLevel,
               L(' '), LogCategory, L(':'), LogLineNumber, L(' - '), Unknow)


The performance is pretty good. At least much faster than my workarround (match the lines with a regexp for the date an send only a block at the time to modgrammar).

Thanks again for the help and this wonderful library!

--
You received this message because you are subscribed to a topic in the Google Groups "modgrammar" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/modgrammar/OFGGTRTCthc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to modgrammar+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Kiov - Gerardo Martín Chaves
Linux user #449707
Desktop: Debian Testing/Unstable/Experimental mix
Reply all
Reply to author
Forward
0 new messages