I'm starting to work on a source code similarity detector, primarily to detect plagiarism in university programming assignments.
MOSS,
SIM, and
Algae are similar tools, but I have
some different design goals.
I'm wondering if ANTLR4 is a good fit as a frontend for this system, and I like the fact that it seems like it would be easy to add support for multiple languages. But I have a few questions and concerns for those that are more well versed in using ANTLR4.
Note that I have never built one of these before. But as far as I understand, the usual approach is something like what follows:
- Parse the code and build a parse tree
- Normalize the parse tree by removing variable names, braces, and other punctuation, and replacing literals
- Serialize what remains by flattening identifiers: for example, replacing all tree nodes with small integers. (Step #2 should have left you with a small number of node types.)
- Compare the serialized output using a variety of approaches that attempt to be robust to reordering and refactoring. (At this point it becomes a non-ANTLR problem, although computationally complex and interesting in its own right.)
Steps #1 and #2 produce some questions about ANTLR:
- Is ANTLR4's parsing robust enough to deal with adversarial behavior? For example, for my course we use a GCC-based C compiler toolchain. If I pick up the C grammar for ANTLR4, but students can defeat it by mixing in GCC-specific goo that ANTLR4 can't handle gracefully, then that's a problem. Importantly, note that the plan here is to distribute the sources for this tool, and so we have to assume a certain amount of adversarial behavior.
- Does the ANTLR4 parse tree contain enough information to do #2?
And of course there is the obligatory question: do you know whether anyone has tried this before? I did search the forum a few times, but it's possible my keywords were badly chosen.