On 2012-05-08, Fritz Wuehler <
fr...@spamexpire-201205.rodent.frell.theremailer.net> wrote:
> Hello guys I stumbled upon awk today and it's pretty neat. I am thinking
> about an application and I don't know if it's better to write it in C or
> awk. Basically I want to scan source for certain commands and do a
> substitution like the cpp on steroids. Doing the regexp and matching for
A rege replacement hack of this sort will envariably start more like a cpp
on weed, and in all likelihood will stay stoned.
Firstly CPP recognizes macro calls of this form
stuff stuff stuff MACRO(
ARG,
ARG ) more stuff
so you need proper lexical analysis that spans across lines.
Secondly, macro calls can nest, because macro arguments can themselves
contain arbitrary material, including macro calls.
stuff stuff stuff MACRO(
stuff MAC2(arg, mac3,
mac4, xyz()) junk, // comment: xyz is a function
ARG ) more stuff
Sure, that *can* all be done by iterating with regular expressions,
but your program will no longer be very ... awky. More like awkward.
Still, perhaps better than the C program to do the same thing.
You have to treat the entire file as a string and look for macro calls
which are primary: an identifier followed by parenthes which do not
have a parenthesis between them. Expand those first and then iterate.
This is not so simple because some things will not expand, like xyz()
above which isn't a macro. What you can do is translate it anyway, to some
special encoding encoding for unexpanded calls which hides the parentheses (so
you can regex over it), and which is then decoded in a final pass over the
buffer back to the original notation.
One possible encoding for unexpanded calls is some notation which replaces
xyz() with, say '@123'. This 123 is just a key in a hash table which maps
to the string "xyz()". Of course if you see '@' in the original input, you
escape it to "@@" so it doesn't confuse your final pass.
So given MAC1(a,b,MAC2(xyz(MAC3(c)),d,MAC4())), we make a first pass
in which our regex finds
MAC1(a,b,MAC2(xyz(MAC3(c)),d,MAC4()))
^^^^^^^ ^^^^^^
These are macros, and so they get expanded. (Oh, and by the way, the C
preprocessor automatically kills macro recursion. If a macro expansion produces a macro call that was already expanded during that expansion, it is not
expanded again.)
So now we iterate, looking again for a macro which contains no parentheses.
MAC1(a,b,MAC2(xyz(MAC5(4),FOO),d,MAC 4 REP))
^^^^^^^
Let's say MAC5(4) goes to the replacement text M54. Next:
MAC1(a,b,MAC2(xyz(M54,FOO),d,MAC 4 REP))
^^^^^^^^^^^^
Oops, this is not a macro, so we have to leave it alone. But of course
M54 and FOO could be object-like macros. We have to expand those first
to obtain a fully expanded version of xyz(M54,FOO) which could look like
something else, say xyz(1,2,3,4). We "freeze dry that" into @1 (the first
entry in our freeze dry hash table). This lets us continue:
MAC1(a,b,MAC2(@1,d,MAC 4 REP))
^^^^^^^^^^^^^^^^^^^^
We expand MAC2, etc.
Also, you have to keep in mind that there are macro defining constructs which
can appear anywhere. If you assume that they have to be set off by a #define at
the beginning of the line, this simplifies things.
You have to break the file into regions in between directives like #define,
and do the macro expansion between those regions.
awk record separation might be useful for this, giving you strings that either
begin with '#' and are preprocessing directives, or else giving you strings
that are multi-line text in between preprocessing directives (or at the
beginning or end of a file).
Good luck.