ANTLR4 Python parsing big files

88 views
Skip to first unread message

prthrokz

unread,
Mar 10, 2016, 7:55:22 PM3/10/16
to antlr-di...@googlegroups.com
I am trying to write parsers for juniper/srx router access control lists. Below is the grammar I am using:
grammar SRXBackend;





grammar SRXBackend; acl: 'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}' applications addressBook ; replaceStmt: 'replace:' IDENT | 'replace:' 'from-zone' IDENT 'to-zone' IDENT ; policy: 'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}' ; fromStmt: 'source-address' addrBlock # sourceAddrStmt | 'destination-address' addrBlock # destinationAddrStmt | 'application' (srxName ';' | '[' srxName+ ']') # applicationBlock ; action: 'permit' ';' | 'deny' ';' | 'log { session-close; }' ; addrBlock: '[' srxName+ ']' | srxName ';' ; applications: 'applications' '{' application* '}' | 'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt '{' 'applications' '{' application* '}' '}' '}' ; addressBook: 'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}' | 'groups' '{' replaceStmt '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}' ; application: 'replace:'? 'application' srxName '{' applicationStmt+ '}' ; applicationStmt: 'protocol' srxName ';' #applicationProtocol | 'source-port' portRange ';' #applicationSrcPort | 'destination-port' portRange ';' #applicationDstPort ; portRange: NUMBER #portRangeOne | NUMBER '-' NUMBER #portRangeMinMax ; addrEntry: 'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet | 'address' srxName cidr ';' #addrEntrySingle ; addrEntryStmt: ('address-set' | 'address') srxName ';' ; cidr: NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)? ; srxName: NUMBER | IDENT | cidr ; COMMENT : '/*' .*? '*/' ; NUMBER : [0-9]+ ; IDENT : [a-zA-Z][a-zA-Z0-9,\-_:\./]* ; WS : [ \t\n]+ -> skip ;
When I try to use an ACL with ~80,000 lines, it takes upto ~10 minutes to generate the parse tree. I am using following code for creating the parse tree:

from antlr4 import * from SRXBackendLexer import SRXBackendLexer from SRXBackendParser import SRXBackendParser import sys def main(argv): ipt = FileStream(argv[1]) lexer = SRXBackendLexer(ipt) stream = CommonTokenStream(lexer) parser = SRXBackendParser(stream) parser.acl() if __name__ == '__main__': main(sys.argv) I am using Python 2.7 as target language. I also ran cProfile to identify which code takes most time. Below are the first few records sorted on time:

ncalls tottime percall cumtime percall filename:lineno(function) 608448 62.699 0.000 272.359 0.000 LexerATNSimulator.py:152(execATN) 5007036 41.253 0.000 71.458 0.000 LexerATNSimulator.py:570(consume) 5615722 32.048 0.000 70.416 0.000 DFAState.py:131(__eq__) 11230968 24.709 0.000 24.709 0.000 InputStream.py:73(LA) 5006814 21.881 0.000 31.058 0.000 LexerATNSimulator.py:486(captureSimState) 5007274 20.497 0.000 29.349 0.000 ATNConfigSet.py:160(__eq__) 10191162 18.313 0.000 18.313 0.000 {isinstance} 10019610 16.588 0.000 16.588 0.000 {ord} 5615484 13.331 0.000 13.331 0.000 LexerATNSimulator.py:221(getExistingTargetState) 6832160 12.651 0.000 12.651 0.000 InputStream.py:52(index) 5007036 10.593 0.000 10.593 0.000 InputStream.py:67(consume) 449433 9.442 0.000 319.463 0.001 Lexer.py:125(nextToken) 1 8.834 8.834 16.930 16.930 InputStream.py:47(_loadString) 608448 8.220 0.000 285.163 0.000 LexerATNSimulator.py:108(match) 1510237 6.841 0.000 10.895 0.000 CommonTokenStream.py:84(LT) 449432 6.044 0.000 363.766 0.001 Parser.py:344(consume) 449433 5.801 0.000 9.933 0.000 Token.py:105(__init__)

I cannot really make much sense out of it except InputStream.LA takes around half a minute. I guess this is due to the fact that the entire text string gets buffered/loaded at once. Is there any alternative/more lazy way of parsing or loading data for Python target? Is there any improvement I can make to the grammar to have the parsing faster?

Thank you


the_antlr_guy

unread,
Mar 11, 2016, 12:42:45 PM3/11/16
to antlr-discussion
use an unbufferedtokenstream; not sure if python has it. are you using latest python release? it got much faster.

prthrokz

unread,
Mar 11, 2016, 12:50:59 PM3/11/16
to antlr-discussion
Thank you for the response. UnbufferedTokenStream is currently not supported for python target. I am using the latest python release (https://pypi.python.org/pypi/antlr4-python2-runtime). Can you please suggest if there is a way to improve my grammar? Thank you very much for your time.

Terence Parr

unread,
Mar 11, 2016, 1:04:53 PM3/11/16
to antlr-di...@googlegroups.com
Sorry. I don’t have time to investigate for you
T
--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/EcA1AuOxABk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric Vergnaud

unread,
Mar 15, 2016, 10:05:20 AM3/15/16
to antlr-discussion
It seems that a lot of time is spent in the lexer.
I would suggest to move all lexer rules into a lexer grammar.
This might help ANTLR4 make faster decisions.
Reply all
Reply to author
Forward
0 new messages