Where to start? Need to build a C to C translator

494 views
Skip to first unread message

Ryan W

unread,
Jan 4, 2013, 12:14:35 AM1/4/13
to antlr-di...@googlegroups.com
Hi,

I had need to write a C to C translator a couple years ago.  I have no formal background in parsers or compilers, so my self-taught solution resulted in writing my own crude and limited lexer and grammar in flex and bison.  I invented a C++, objected oriented, intermediate representation for the job.  I have since learned the IR I created is what you compiler types would call an AST.  This parser only handles simple types and function declarations, and generates some wrapper code.  It is limited, and being my first effort it is poorly done.

I now have need for more comprehensive language support, and doing it all from scratch doesn't seem like the best approach this time.  I tend to think in OOP, and yacc/bison doesn't. I tried and failed miserably to get the C++ flavor of bison working.  There just doesn't seem to be anyone using it.  So I went with a C grammar that connects to C++.  That experience with bison leaves me looking for something different.  I'm hoping ANTLR is that thing.

So where to start?  My understanding of this stuff is self-taught on the fly.  So some book suggestions are in order this time.  Please don't suggest any texts that describes things using set algebra or lambda calculus.  My eyes will gloss over and I'll fall asleep.  I'm more of the Pragmatic Programmer's persuasion.  Good tutorials?

Lastly, I'm going to want to use this tool to instrument my C code.  I'll be inserting lines that add some tracing capabilities which write to log files.  I'd like to preserve preprocessor statements and comments in the output code.  The grammars I've seen listed on the ANTLR website don't seem to go that far.  Are there other C ANTLR grammars out there?  Commercial ones maybe?

-Regards,
Ryan

Ryan W

unread,
Jan 4, 2013, 12:22:38 AM1/4/13
to antlr-di...@googlegroups.com
On the topics of books.  Does the new book that covers ANTLR 4 sufficiently also cover ANTLR 3?  The reason I ask, is post indicate that the C grammar isn't going to be ready for ANTLR 4 for a while.  So, if I want to get cracking, using the ANTLR 3 with the existing grammar sounds like my best bet.  In that case, if I buy the new book, will I be missing something that I need to know from the old one.

-Ryan

Eric T

unread,
Jan 4, 2013, 9:21:08 AM1/4/13
to antlr-di...@googlegroups.com
Hi Ryan,
 
I am no expert on this so don't take this is the absolute truth.

What you are probably looking for is term rewriting. See: Rewriting (Wikipedia) and Abstract rewriting system (Wikipedia)

If you think of an imperative programming language such as C in a simplified form as a set up expressions augmented with statements, then by applying term rewriting, you can transform the program. Remember that a program is built up as terms, in the same way that an algebraic expression or most any expression is built up as terms.

You are correct in that you need to get to the AST to do the rewriting, but it doesn't matter how you get to the correct AST, e.g. Lexx/Yacc, Bison, ANTLR, just so long as you get there. You can even create the AST using one tool or programming language, then use a different tool or programming language to do the rewrites.

While you want something that doesn't mention set algebra or lambda calculus, I really think you should take the time to at least understand why these keep coming up in your searches. I spent several months learning this on my own and it was worth the effort. Why spend years redoing the research that so many have done before you.
Maude is a popular one but I haven't used it.

One of the problems you may run into latter if you don't plan for it at the start is that what may appear to be simple syntactic rewriting in fact needs to be semantic rewriting accomplished by syntactic rewriting. In other words you may be able to change the symbols around to get them to appear to do what you want, but is so doing you may have changed the semantics of what was meant and not realize it until you hit a bug running the rewritten code.

If you design the system with rules that do simple translations, be aware that you can get into patterns that never terminate because one rule changes the syntax from A to B, and then another rule changes the syntax from B to A and never terminates.

The book I started with is Term Rewriting and All That (Amazon) Term rewriting and all that (World Cat)

Some free advise. If you have a finite amount of code to change and the time will be less than learning and implementing a rewrite system, then do it by hand. If I had only a few hundred pages of code to translate from C to C, then I would definitely do it by hand rather than creating a rewrite system. If on the other hand you plan on making a company out of it then take the time. For a commercial company that sells and supports such tools see    Semantic Designs and DMS® Software Reengineering Toolkit.

mycircuit

unread,
Jan 7, 2013, 7:32:26 AM1/7/13
to antlr-di...@googlegroups.com
Hi Ryan
I am currently evaluation lexer/parsers for C/C++. I started out with ANTLR3 and the C target (http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+C+Target).

I succeeded in compiling and I could parse a simple Java file successfully. However, the parser failed on a more complex Java file, that could be parsed correctly with the Java target.

( note, since perhaps you are new to ANTLR: when I say 'target', I don't mean the language that is parsed, but the language version of ANTLR that is used to parse. The default target - the generated lexer/parser code - of ANTLR is Java, and there are targets in Python, C and others. So, the C target can  be used to parse Java, C, Python , depending on what .g you use. )

The problems I had with the C target , and the reason why I gave up on it, were, amongst others:

- the C target lacks documentation ( although the code is documented ), especially the error reporting and the customization of it seem to be very different from the Java target and this particular function is important to me ( I want to develop IDE support )
- the C target, is as far as I understand, a brilliant and heroic effort by one individual ( Jim Idle ) who did the port - at least he answers the related questions - , but the port has not the full community support
- Terence Parr, the creator of ANTLR is selling suppot for the Java target only
- the C++/C family is not on the top priority list for ANTLR4, judging from some emails in the forum.

Please note, that these are my personal impressions, and I hope the community will correct me , if necessary.

So I turned to clang (http://clang.llvm.org/) , a very interesting opensource project, also warmly recommended by Terence (  he has an example in the Wiki , not related to C, though ). It is full production as it ships with Apple development tools ( XCode ) and if this is not enough , it is also sponsored by Google, Intel and Qualcom.

Clang offers a libary, libclang, that offers an API to most of its lexing and parsing functions. It allows you to build tools for rewriting/translation, static analysis and code completion.

To get a feeling, watch the videos here: http://llvm.org/devmtg/2010-11/  , especially the one by Doug Gregor. There is a conference each year and some of the videos are interesting.

The libclang API is relatively fresh and there is a lack of examples, but I got it working with the tutorials here: https://github.com/loarabia/Clang-tutorial#readme

It's not easy - at leat for me as a sub-standard C/C++ guy - , but once you master this, you have world class C/C++ preprocessing, lexing and parsing at your fingertips. And it is very fast. ( because Apple wants it fast )

I use ANTLR for all the other languages I want to experiment with, in particular Java, Javascript and Sparql for which you can find grammars on the Wiki ( all for the use with the Java target, by the way).
But for C ( you might need preprocessing and macro expansion ) and especially for C++ - being a really complicated beast - I feel clang is the better choice for me.

Hope that helps.

Ryan W

unread,
Jan 7, 2013, 11:27:52 AM1/7/13
to antlr-di...@googlegroups.com
Thanks for the feedback.  Maybe the following line of inquiry will guide me towards where I need to go.

What is the difference between tools like ANTLR and Clang, vs DMS® Software Reengineering Toolkit and the Rose Compiler Infrastructure.  For the purposes of building a C to C translator, is it a matter of where my starting point would be?

In other words, do DMS or Rose add an additional layer that gets me closer to the goal of building my tool?  What are they providing that just using an ANTLR or Clang based parser do not provide?

Does ANTLR map 1-to-1 with Clang?

Maybe I'm missing a mental map of the layers involved here.  And a list of which tools provide which layers would clarify what offers what, greatly.

Thanks,

Ryan
On Thursday, January 3, 2013 11:14:35 PM UTC-6, Ryan W wrote:

Eric

unread,
Jan 7, 2013, 12:26:09 PM1/7/13
to antlr-di...@googlegroups.com
Hi Ryan,
 
Glad you are asking questions.
 
The process by which I see you needing is
 
 
 
Lexer -> Parser -> AST rewriting -> AST to C code output
|--------------|  |-------------|   |-------------------|
ANTLR
Lexx/Yacc
Bison
                   Maude
                   Stratego
                  
                                     String Template
 
Note: I added Stratego because they have good documentation that might help you understand the process better.
 
|--------------------------------------------------------|
DMS
 
Note: DMS does all of steps where the other tools do sub-steps.
Note: DMS is a very expensive tool. I think it cost several thousand dollars if not tens of thousands of dollars.
 
==============================================================
 
Note: If you stick with one family of tools then it should be easier to pass from one step to the next step. If you use different tools for different steps, then you will probably have to change the output of one step, (e.g. AST) to the format for input into the next step.
 
I only listed very few tools to give you an idea, there are many hundreds that fill each of the three steps or all three steps.
 
Please keep asking questions.
 
 
      
                
--
 
 

mycircuit

unread,
Jan 7, 2013, 4:16:57 PM1/7/13
to antlr-di...@googlegroups.com
No, Clang and ANTLR don't map to each other at all, these are different tools with different purposes, ANTLR is a lexer/parser generator and clang is a lexer parser specialized for the C family of languages.

So, these are two different "investments" in term of time and learning effort

Ryan W

unread,
Jan 7, 2013, 4:54:19 PM1/7/13
to antlr-di...@googlegroups.com
Ok. Sure.  So if I had a C grammar to feed into ANTLR, I could produce something like Clang.  Or does Clang do even more than that?

If DMS spans the entire stack from a commercial perspective.  What would be some good full stacks to explore from an open source perspective.  Again, I'm specifically looking at C to C.  I will also be generating some custom side files, but I suspect the C to C part has been done time and time again.  And I'd rather not reinvent that.

Does Clang help me in the AST rewriting space?

DMS sounds like it supports many languages, and Rose sounds like maybe it is comparable to DMS with the limitation of being restricted to C/C++.

Thanks again.
Ryan

Eric

unread,
Jan 7, 2013, 5:31:15 PM1/7/13
to antlr-di...@googlegroups.com
At this point lets break your problem into two directions based on the thread I have with you. I don't know or use CLang so I will leave that alone.
 
1. What would be some good full stacks to explore from an open source perspective?
 
I was not aware of ROSE compiler, but in doing some quick reading on it based on your needs, I myself would spend the time checking it out and possibly using it if I had your need.
 
 
The SageBuilder looked like something you would need to add the instrumentation code for tracing.
 
And ask questions on their new group.
 
2. Instrument my C code.
 
I take it you want an app that accepts C source and adds C source lines to fill up a log file with details.
 
Basically you have an age old problem that I am sure has been solved and made freely public, you just need to find it.
 
As I don't know how much or how little detail you need, have you tried Googling "instrument C code"
 
This turned up when I searched.
 
For main stream languages such as C I have always found these public and free apps when I search enough.
 
Again keep asking questions.
 
 
 


 
--
 
 

mycircuit

unread,
Jan 7, 2013, 6:45:03 PM1/7/13
to antlr-di...@googlegroups.com
No, Clang doesn't  produce a grammar ( a .g file in ANTLR, which is produced by you, not ANTLR ). clang is a c/c++ compiler, so it only knows how to parse C/C++, what sets clang apart from, say, gcc for which it is a drop-in replacement, is that it makes available its lexing / parsing power ( the AST construction ) available to you as a tool builder via an API ( this is something that ANTLR does also, but not gcc for example) . So Clang helps you in the AST rewriting space and this is illustrated by tutorial 3 ( on the github link I gave ) , where a simple tool is shown ( some 50 lines of code ) that reads a C file and then rewrites the same file with some comment inserted at the begin of each function. So the API has the notion of a function ( the annotated AST ) and you can query against the AST to retrieve almost every possible thing, like all declarations, all usages of data structure x in function y, the thing under the cursor ( line, pos) , etc.

Again, just to be clear: I am not saying at all that clang is a better tool than ANTLR. Those are tools with different purposes and strengths. Although ANTLR allows you to parse C/C++ based on an appropriate .g file, clang is specialized on the C/C++ family and you can never parse, say, Java. As usual, for extremely complicated tasks, a special purpose tool might fit your purpose better than a more generic tool and I believe that C++ is such a complicated language ( and macro expansion and  and and ) that you might at least want to have some choice.

Ryan W

unread,
Jan 12, 2013, 11:32:59 PM1/12/13
to antlr-di...@googlegroups.com
Thanks for all the information.  I am starting to play with ROSE.  For the purposes of doing rewriting, it certainly seems like a good place to start.


On Thursday, January 3, 2013 11:14:35 PM UTC-6, Ryan W wrote:

Eric

unread,
Jan 13, 2013, 10:49:25 AM1/13/13
to antlr-di...@googlegroups.com
Glad to see you are moving forward. For me, once I learned how to
rewrite and manipulate ASTs a new world of possibilities opened up.

At present I only look upon lexing/parsing as a means to get to the
AST. I don't look at lexing/parsing as a means to creating a Domain
Specific Language (DSL) or even part of a compiler, but just some tool
in a tool chain. Also once I started manipulating ASTs, I saw the true
need and power of the tools like StringTemplate for converting ASTs
into text. Hopefully you will stick with it, even though it may take a
few weeks to months before you see something tangible. Once you get
your first problem of AST manipulation/rewrite done, they rest keep
coming easier and quicker, and soon you will just see the pattern
realizing how many times you could have used it before.

If you're like me and want to understand how some of the more advanced
systems work, you will find that they rely upon syntactic unification
(http://en.wikipedia.org/wiki/Unification_(computer_science). Don't
try and learn unification by reading the wikipedia article; knowing
that it is the heart of PROLOG makes it easier to find references.

Also, ROSE looks like a good choice, but if if doesn't suit your
fancy, find another tool or mechanism. Remember that the AST is a tree
and tree traversal and tree rewriting will accomplish the same goal of
AST rewriting, but still be aware that changing syntax can change
semantics and cause problems.
> --
>
>
Reply all
Reply to author
Forward
0 new messages