programming language/file format recognition using multiple grammars

78 views
Skip to first unread message

Shaul Zevin

unread,
Jul 31, 2015, 10:51:49 AM7/31/15
to antlr-discussion
Hi,

In my case the programming language or a file format is not known in advance.

My goal is to emit a feature (basically a specific string) in case a particular syntax structure is successfully parsed.
For example if the program encounters f(a,b,c);, I want to issue a feature "FUNCTION_CALL". 
Later those features are fed into machine learning algorithm which will decide which language or file format was used.

I do not need to validate syntactic correctness of the input file. My assumption is that the file is valid for the programming language it was written in.

One approach is to execute all grammars I have against the input one by one.

My main concern is performance. I want to fail on the first error - in my case the error means that the input does not match the grammar language.

What do I need to do to gain the maximum performance? Any other tips besides using BailErrorStrategy?

Thanks,
   Shaul

Mike Lischke

unread,
Aug 2, 2015, 6:38:25 AM8/2/15
to antlr-di...@googlegroups.com
Shaul, that sounds like an interesting project. However, I think using one or more parsers to do the job is not working well. Actually, once your input can be recognized by a given parser you very likely have the particular language, without even feeding your result to the machine learning part. Additionally, you will have to create a grammar for every language and format to identify, which probably gets expensive (depending on your needs).

Instead you could use a similar approach as what is used for speech recognition. Take a lexer to tokenize your input and feed that directly to your "learning machine" (a markov chain or whatever you are using). This way you do not identify a higher level construct like a function call (which is not only ambigiuos but can have a totally different meaning in another context) but let your machine make sense of the given input stream directly. 

Mike
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shaul Zevin

unread,
Aug 10, 2015, 1:12:58 AM8/10/15
to antlr-discussion
Thanks Mike. I am following your advice. 
I have constructed extended lexer for C by adding precompiler key words to the C grammar.

Now I want to add Java. I am not sure what would be the best approach.
1. I can extend C lexer by adding java keywords.
2. I can have a separate lexer for java and run C and java separately.

In approach 1, I may run into lexer errors since C and java are similar syntactically, but still have quite few unique keywords.
Approach 2 is expected to be slower as same identifier for example will be parsed twice.

Of course I can try to implement both and check which works better, but before doing so I wanted to know your opinion.

Mike Lischke

unread,
Aug 13, 2015, 4:05:37 AM8/13/15
to antlr-di...@googlegroups.com
Thanks Mike. I am following your advice. 
I have constructed extended lexer for C by adding precompiler key words to the C grammar.

Now I want to add Java. I am not sure what would be the best approach.
1. I can extend C lexer by adding java keywords.
2. I can have a separate lexer for java and run C and java separately.

The same issue applies here as with the parsers. If you have to write a specific lexer for each language then you have already your "recognition". You simply iterate over all lexers until one "matches". I don't think this is a good approach, since you want to recognize the file content by a "learning machine"/neuronal network/whatever, not by simply trying a predefined set of patterns.


In approach 1, I may run into lexer errors since C and java are similar syntactically, but still have quite few unique keywords.
Approach 2 is expected to be slower as same identifier for example will be parsed twice.

Tbh, I think this problem is not at all related to ANTLR and hence doesn't fit here. You don't need a parser/lexer for file recognition, but a different approach. Since I haven't done anything similar yet, I cannot give you any more advice. You first need a general strategy before you even try to implement details of your recognition chain.



Gerald Rosenberg

unread,
Aug 13, 2015, 10:53:14 PM8/13/15
to antlr-discussion
You may want to consider using the simpler, more light-weight approach to automatic code language detection used by any of the common code highlighers. 

In particular, take a look at Google's code-prettify and Highlight.js.  Both provide well-maintained language signature sets and (at least) worked examples of how to iteratively apply them to 'guess' the source language.

Shaul Zevin

unread,
Aug 14, 2015, 2:17:50 AM8/14/15
to antlr-discussion
Thanks Gerald. 

From the first glance both iterate on languages lexers. Highlight.js seems to have weight assigned to each lexer rule, so the language with higher lexer weights wins.

I think Mike has the point, machine learning should take different approach, which will effectively eliminate the need to write lexer for every language and the need to iterate over the lexers.

Jim Idle

unread,
Aug 14, 2015, 3:35:53 AM8/14/15
to antlr-discussion
I would have thought a simple baysian filter could do this much better than trying to do it with ANTLR?





Gerald Rosenberg

unread,
Aug 14, 2015, 3:03:31 PM8/14/15
to antlr-discussion
Machine learning is certainly possible, though depending on the particular ML technique chosen, there are distinct problems as well. 

Training on language samples will necessarily make the developed rule base sensitive to everything from variable naming choices to incidental code structure.  This can be overcome if the training sample set is large enough and is comprehensive of all of the different code styles and fragmentary structures that the ML engine is asked to evaluate.  Sample set size will have a inverse relation to false positive rate, though non-linear and even discontinuous.

You may wish to consider using the Apache OpenNLP tool set and, in particular, focus on use of a named-entity detector to weight towards distinctive keyword and symbol sets.  (Very hard to say whether, in the long run, this is a better approach than using a simple heuristics approach as earlier suggested.)

Jim Idle

unread,
Aug 14, 2015, 9:29:12 PM8/14/15
to antlr-discussion
Could well work too. It seems though that a Baysian filter would be really easy to try out, so not a lot of wasted effort to try it out. Nilsimsa hashing may also be beneficial, perhaps as an additional indicator. 





Gerald Rosenberg

unread,
Aug 15, 2015, 3:47:25 PM8/15/15
to antlr-discussion
Agreed.  Though, ultimately, an interesting issue is whether standard 2-way Baysian filters would be sufficient.  In particular, the problem of how to handle the case where multiple positives (false or otherwise) are obtained for the same source sample.

Last I looked (quite some time ago) there were no standard/public implementations of an n-way Baysian filter.  I built one -- the math was an interesting adventure -- and could probably dig out the code if of interest.

dasc...@gmail.com

unread,
Aug 15, 2015, 4:07:43 PM8/15/15
to antlr-discussion
Just use the 4-16 cores on your CPU by concurrently executing M lexers on the first N bytes of the input, where M is the number of cores. Now iterate until all lexers have executed. For all lexers yielding a match on the partical input, run full lexing (or even parsing) Bail out at errors, the one bailing last or not at all wins  - after recovery attempts of course.

You'll be surprised how easy this is to implement (assuming you have lexers) and how efficient it is.

Shaul Zevin

unread,
Aug 16, 2015, 1:29:25 AM8/16/15
to antlr-di...@googlegroups.com
Thanks Gerald,

n-way Baysian filter code would be certainly of interest.

I am using Stanford CoreNLP Toolkit (Maximum Entropy Model).

It would be interesting to see which one does better.

--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/7xlvuE33cwc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.

Gerald Rosenberg

unread,
Aug 16, 2015, 4:44:11 PM8/16/15
to antlr-discussion
Just pushed nBayes to Github.  This was not intended as a stand-alone library, but rather as part of a much larger Eclipse plugin project. 

To evaluate, start with BayesPartitionClassifier.java and PersistantWordsDataSource.java.  Code was written around 2004, but I should be able to help if you have questions.

Shaul Zevin

unread,
Aug 17, 2015, 2:21:32 AM8/17/15
to antlr-discussion
Thanks Gerald, really appreciated!

Will let you know the comparison to maximum entropy results when done.

Gerald Rosenberg

unread,
Aug 17, 2015, 5:41:20 PM8/17/15
to antlr-discussion
Another independent implementation - in Go.


https://github.com/jbrukh/bayesian - Perform naive Bayesian classification into an arbitrary number of classes on sets of strings.



On Sunday, August 16, 2015 at 11:21:32 PM UTC-7, Shaul Zevin wrote:
Thanks Gerald, really appreciated!
Reply all
Reply to author
Forward
0 new messages