ANTLR4 as a Frontend for a Plagiarism Detection System

74 views
Skip to first unread message

Geoffrey Challen

unread,
Apr 26, 2017, 4:43:16 PM4/26/17
to antlr-discussion
I'm starting to work on a source code similarity detector, primarily to detect plagiarism in university programming assignments. MOSSSIM, and Algae are similar tools, but I have some different design goals.

I'm wondering if ANTLR4 is a good fit as a frontend for this system, and I like the fact that it seems like it would be easy to add support for multiple languages. But I have a few questions and concerns for those that are more well versed in using ANTLR4.

Note that I have never built one of these before. But as far as I understand, the usual approach is something like what follows:
  1. Parse the code and build a parse tree
  2. Normalize the parse tree by removing variable names, braces, and other punctuation, and replacing literals
  3. Serialize what remains by flattening identifiers: for example, replacing all tree nodes with small integers. (Step #2 should have left you with a small number of node types.)
  4. Compare the serialized output using a variety of approaches that attempt to be robust to reordering and refactoring. (At this point it becomes a non-ANTLR problem, although computationally complex and interesting in its own right.)
Steps #1 and #2 produce some questions about ANTLR:
  • Is ANTLR4's parsing robust enough to deal with adversarial behavior? For example, for my course we use a GCC-based C compiler toolchain. If I pick up the C grammar for ANTLR4, but students can defeat it by mixing in GCC-specific goo that ANTLR4 can't handle gracefully, then that's a problem. Importantly, note that the plan here is to distribute the sources for this tool, and so we have to assume a certain amount of adversarial behavior.
  • Does the ANTLR4 parse tree contain enough information to do #2?
And of course there is the obligatory question: do you know whether anyone has tried this before? I did search the forum a few times, but it's possible my keywords were badly chosen.

Mike Lischke

unread,
Apr 27, 2017, 5:40:18 AM4/27/17
to antlr-di...@googlegroups.com
Geoffrey,

I'm wondering if ANTLR4 is a good fit as a frontend for this system, and I like the fact that it seems like it would be easy to add support for multiple languages. But I have a few questions and concerns for those that are more well versed in using ANTLR4.

That's an interesting project, but it's clear that ANTLR related stuff is only the tip of the iceberg. All what a parser can do is to give you a structured view on your source code.


Note that I have never built one of these before. But as far as I understand, the usual approach is something like what follows:
  1. Parse the code and build a parse tree
  2. Normalize the parse tree by removing variable names, braces, and other punctuation, and replacing literals
  3. Serialize what remains by flattening identifiers: for example, replacing all tree nodes with small integers. (Step #2 should have left you with a small number of node types.)
  4. Compare the serialized output using a variety of approaches that attempt to be robust to reordering and refactoring. (At this point it becomes a non-ANTLR problem, although computationally complex and interesting in its own right.)
Steps #1 and #2 produce some questions about ANTLR:

In fact it's only #1 that is related to ANTLR and for this it's like for any other parsing task: the parser can only recognize a language that you exactly designed in your grammar. Parsers like those created by ANTLR are not made to do fuzzy matching. If you want flexibility in your input you have to build that into your grammar.

  • Is ANTLR4's parsing robust enough to deal with adversarial behavior? For example, for my course we use a GCC-based C compiler toolchain. If I pick up the C grammar for ANTLR4, but students can defeat it by mixing in GCC-specific goo that ANTLR4 can't handle gracefully, then that's a problem. Importantly, note that the plan here is to distribute the sources for this tool, and so we have to assume a certain amount of adversarial behavior.
Also here: it's not ANTLR's responsibility (or that of the generated parsers) to be robust against anything than parsing exactly what it has been told to. If your grammar doesn't include GCC specific input then it won't be recognized, period. A parser recognizes a language, which is defined by a grammar and only what has been defined upfront can be recognized (compiler build course, first lecture). But that's not only related to ANTLR, any parser which is based on a grammar acts like this.

  • Does the ANTLR4 parse tree contain enough information to do #2?
Depends on what you actually need. The parse tree is a syntactic view on the input. What you make out of it from a semantic point of view is up to the apres-parse step(s).

And of course there is the obligatory question: do you know whether anyone has tried this before? I did search the forum a few times, but it's possible my keywords were badly chosen.

The role of a parser generator (and parser) in this project is only small. It's like you wanna write a book and you go to an office tool forum to ask how to input letters and what to do next to write a book. You probably find a better audience by going to a book writer forum and try to find someone who's using the same office tool like what you are going to use.

Good luck with your project,

Mike

Reply all
Reply to author
Forward
0 new messages