can any one help me out in writing an awk script that can completely
strip the comments out of a C file.????
i hav tried doing it but unable to strip all comments...
Regards,
Partha
To make that possible you need a C syntax parser; otherwise all
solutions would not be working in any case. (One possibility -
off-topic in this newsgroup - is to use th C preprocessor cpp to
do that task; though that has other effects that you might find
undesirable in your application context.)
In awk I'd start this way...
/\/\*/,/\*\// { next } 1
...which removes all block comment lines. Then, within the line,
add pattern matching to catch functional C-code immediately before
and after the comment symbols on the same line to not suppress that
code.
Janis
Why do you want to do this?
printf("\*This isn't a comment.\n\*");
First, you need a regular expression to match a comment. This isn't very
simple, since "/*" inside string quotes isn't one. So, you'll need
regexps to handle strings too.
Also, comments can (and often do) extend from one line to the next. So
can strings. AWK might not be your best tool here.
If you are going to be using AWK, you'll probably want variables like
open-string and open-comment which are non-zero whenever a string or
comment from the previous line is still open.
In a regexp for a string, you need to handle escaped quotes and escaped
newlines. A regexp for a terminated string on a single line will look
something like this (untested):
terminated-string-regexp = "\"(\\.|[^\"])*\""
, i.e. an open quote, followed by any number of [either (i) escaped
characters; or (ii) characters which aren't backslashes or quotes],
followed by a closing quote.
You'll be needing regexps for an open string (one which continues to the
next line) and a string end (something which terminates a string opened
on a previous line). You'll have to decide what to do with unterminated
strings which don't end with an escaped newline. Such monstrosities are
illegal in standard C, but permitted in GCC (which caused me a great deal
of pain in the recent past).
Your regexp for a terminated comment on a line will then look something
like this (where different bits are concatenated and I've split the thing
onto two lines for clarity):
"^( | | )*/\*" comment-innards "\*/"
[^"/\] /[^*] (" terminated-string-regexp ")
, i.e. any number of "harmless" characters, escaped characters, slashes
not followed by "*", and strings, followed by "/*" the inside of a
comment and "*/".
You might also want to handle C++ style comments (which, I believe, are
now also legal in standard C). They're a lot easier, but also not
entirely trivial: For example, the following does NOT end in a comment:
a = b /* comment *// c ;
It can be done (in fact, I did something very similar to locate strings
straddling lines without escaped newlines in the aforementioned problem,
and needed to not find quote marks inside comments). But you'd probably
be better off following the suggestion of using the C preprocessor to
eliminate the comments. Or maybe you could use Emacs, searching for
blocks of text fontified with font-lock-comment-face, and deleting these.
Have fun!
> Partha
--
Alan Mackenzie (Munich, Germany)
Email: aa...@muuc.dee; to decode, wherever there is a repeated letter
(like "aa"), remove half of them (leaving, say, "a").
You CANNOT do this reliably without a C parser. See the discussions on
comp.unix.shell.
Ed.
Don't you mean
printf("/* This isn't a comment.\n */");
BEGIN { ORS = "" }
{ code = code $0 "\n" }
END {
while ( length(code) )
code = process( code )
}
function process( text )
{
if ( match( text, /"|'|\/\*|\/\// ) )
return span( text )
print text
return ""
}
function span( text , starter )
{
print substr( text, 1, RSTART - 1 )
starter = substr( text, RSTART, RLENGTH )
text = substr( text, RSTART + RLENGTH )
if ( "\"" == starter || "'" == starter )
return quoted( text, starter )
if ( "//" == starter )
return remove( text, "\n", "\n" )
return remove( text, "*/", " " )
}
function remove( text, ender, replacement )
{
print replacement
return substr( text, index(text, ender) + length( ender ) )
}
function quoted( text, starter )
{
if ( "'" == starter )
match( text, /^(\\.|[^'])*'/ )
else
match( text, /^(\\.|[^"])*"/ )
print starter substr( text, 1, RLENGTH )
return substr( text, RSTART + RLENGTH )
}
A question for C gurus. Is this a valid C comment?
/*
/* Print "hello". */
puts( "hello" );
/* Print "*/". */
puts( "*/" );
*/
If it is, then does that mean that this one is invalid?
/* " */
There are two comments. After they are removed, you are left with:
puts( "hello" );
". */
puts( "*/" );
*/
...which leaves one valid statement and a mess.
> If it is, then does that mean that this one is invalid?
>
> /* " */
That's a valid comment.
--
Chris F.A. Johnson, author | <http://cfaj.freeshell.org>
Shell Scripting Recipes: | My code in this post, if any,
A Problem-Solution Approach | is released under the
2005, Apress | GNU General Public Licence
So nested comments are not allowed, and one cannot always
comment out code by wrapping it in /* ... */.
/*
a++;
b--;
s = "*/";
*/
The */ within the quotes is seen as the end of the comment.
Since that is so, it seems to me that one should as a rule
use // for comments.
The // comment got a valid comment in the C99 standard, AFAIK,
but was not valid before C99 (only as proprietary extensions);
thus I wouldn't use it if I want to write portable programs.
In C you can also use the #if... preprocessor tags to comment
out blocks of code.
Janis
% So nested comments are not allowed, and one cannot always
% comment out code by wrapping it in /* ... */.
%
% /*
% a++;
% b--;
% s = "*/";
% */
%
% The */ within the quotes is seen as the end of the comment.
% Since that is so, it seems to me that one should as a rule
% use // for comments.
Or, one shouldn't use comments to temporarily remove blocks of
code. The pre-processor works well for this
#if 0
...
#endif
--
Patrick TJ McPhee
North York Canada
pt...@interlog.com
> You CANNOT do this reliably without a C parser. See the discussions on
> comp.unix.shell.
I blieve you are mistaken. Comments are (logically, but often not in
practice) performed entirely by the preprocessor, therefore you only need
at most a preprocessor parser. Further, the processing done on comments is
sufficiently simple that it does not actually require that either.
This script seems to work.
{
#state
#0=normal, 1=in string , 2=in C comm, 3= in C++ comm
for(i=1; i <= length($0); i++)
{
c = substr($0, i, 1)
if(state == 0)
{
if(c == "/")
{
d = substr($0, i+1, 1)
if(d == "*")
state = 2
else if(d == "/")
state=3
else
printf("%s", c d)
i=i+2
continue
}
else if(c == "\"")
state = 1
}
else if(state == 1)
{
if(c == "\"" && substr($0, i, 1) != "\\")
state = 0
}
else if(state == 2 && i > 1 && substr($0, i-1,2) == "*/")
{
state = 0
c = " "
}
if(state < 2)
printf("%s", c)
}
if(state == 3 && !/\\$/)
state = 0
if(state < 2)
print ""
}
It gets the correct output on:
/* "not/here
*/"//"
// non "here /*
should/appear
// \
nothere
should/appear
"a \" string with embedded comment /* // " /*nothere*/
"multiline
/*string" /**/ shouldappear //*nothere*/
/*/ nothere*/ should appear
(assuming multiline literals are allowed)
-Ed
--
(You can't go wrong with psycho-rats.) (er258)(@)(eng.cam)(.ac.uk)
/d{def}def/f{/Times findfont s scalefont setfont}d/s{10}d/r{roll}d f 5/m
{moveto}d -1 r 230 350 m 0 1 179{1 index show 88 rotate 4 mul 0 rmoveto}
for /s 15 d f pop 240 420 m 0 1 3 { 4 2 1 r sub -1 r show } for showpage
I don't know how you write a preprocessor without parsing the
language, but maybe you're right as that's not really my area and I
don't think it makes a difference to the point I was making.
Further, the processing done on comments is
> sufficiently simple that it does not actually require that either.
Not true. This has been debated to many times and so many "solutions"
have been posted it's just incredible how much effort has been spent on
this subject and none of them work, especially given that there's more
than one C standard so you need to control which variants you're parsing
so you know what a given structure means. For example, in ANSI C (C89)
"//" is NOT a comment, but in C99 it is. Add that to the fact that it's
easily done with a C preprocessor that it just seems so pointless to
keep rehashing this, attempting different solutions and finding the
holes in them.
> This script seems to work.
Yes, they all "seem" to work for some specific input set.
Fine, but what about this:
$ cat tst.c
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http://www.google.com);
}
If we run that through your script we get:
$ awk -f tst.awk tst.c
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http:
}
which is not compilable C code, but is that correct output? Well, it
depends on which version of C you're program is.
If we instead take this shell script (sed stuff is to get rid of "#"s
before the call to the preprocessor to avoid macro expansion and
inclusion of header files, then replace the #s afterwards):
$ cat strip.gcc
[ $# -eq 2 ] && arg=$1 || arg=""
eval file="\$$#"
sed 's/a/aA/g;s/__/aB/g;s/b/bA/g;s/#/bB/g' "$file" |
gcc -P -E $arg - |
sed 's/bB/#/g;s/bA/b/g;s/aB/__/g;s/aA/a/g'
and we run it without arguments it produces the same output as your awk
script:
$ ./strip.gcc tst.c
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http:
}
because by default it thinks this is C99, but if I'm instead using ANSI
C, then I can just tell the preprocessor that by passing in a "-ansi"
argument:
$ ./strip.gcc -ansi tst.c
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http://www.google.com);
}
and I get the compilable ANSI C code I need.
So, the main points are:
a) the posted solutions are always fairly lengthy,
b) in my experience, there are always situations where the posted
solutions fail, and
c) it's trivial to just call the preprocessor which you know MUST work
in all situations.
Regards,
Ed.
I told myself I wasn't going to bother shooting holes in this stuff any
more, but I couldn't help myself:
$ cat tst.c
#include "stdio.h"
int main(void) {
printf("comments start with \"/*\"\n");
}
$ ./strip.gcc tst.c
#include "stdio.h"
int main(void) {
printf("comments start with \"/*\"\n");
}
$ awk -f tst.awk tst.c
#include "stdio.h"
int main(void) {
printf("comments start with \"$
Regards,
Ed.
> I don't know how you write a preprocessor without parsing the
> language, but maybe you're right as that's not really my area and I
> don't think it makes a difference to the point I was making.
The C proprocessor is a generic preprocessor (much like M4), and you can
preprocess anything with it (C++, fvwm2 config scripts and Xdefaults files
are common examples where it is used for things other than C). In these
cases, it certainly has no knowledge of the underlying language. The same
is true (in principle) for C.
> Further, the processing done on comments is
>> sufficiently simple that it does not actually require that either.
>
> Not true.
I do not believe that a full parser for the preprocessor language (let
alone the C language) is required to remove comments. I don't have my K&R2
handy, but I am completely sure of this.
> This has been debated to many times and so many "solutions"
> have been posted it's just incredible how much effort has been spent on
> this subject and none of them work, especially given that there's more
> than one C standard so you need to control which variants you're parsing
> so you know what a given structure means.
> For example, in ANSI C (C89)
> "//" is NOT a comment, but in C99 it is. Add that to the fact that it's
> easily done with a C preprocessor that it just seems so pointless to
> keep rehashing this, attempting different solutions and finding the
> holes in them.
<snip program and test>
> Fine, but what about this:
>
> $ cat tst.c
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http://www.google.com);
> }
>
> If we run that through your script we get:
>
> $ awk -f tst.awk tst.c
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http:
> }
>
> which is not compilable C code, but is that correct output?
Yes, for C99. The C proprocessor is not required to output valid C code.
> Well, it
> depends on which version of C you're program is.
correct.
> If we instead take this shell script (sed stuff is to get rid of "#"s
> before the call to the preprocessor to avoid macro expansion and
> inclusion of header files, then replace the #s afterwards):
>
> $ cat strip.gcc
> [ $# -eq 2 ] && arg=$1 || arg=""
> eval file="\$$#"
> sed 's/a/aA/g;s/__/aB/g;s/b/bA/g;s/#/bB/g' "$file" |
> gcc -P -E $arg - |
> sed 's/bB/#/g;s/bA/b/g;s/aB/__/g;s/aA/a/g'
>
> and we run it without arguments it produces the same output as your awk
> script:
That'a a neat solution.
> $ ./strip.gcc tst.c
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http:
> }
My solution can be fixed by replacing
else if(d == "/")
with
else if(d == "/" && c99)
then awk -vc99 in.c > out.c will work.
> because by default it thinks this is C99, but if I'm instead using ANSI
> C, then I can just tell the preprocessor that by passing in a "-ansi"
> argument:
The -ansi argument depends on the C compiler. Making that portable hard.
> $ ./strip.gcc -ansi tst.c
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http://www.google.com);
> }
>
> and I get the compilable ANSI C code I need.
>
> So, the main points are:
>
> a) the posted solutions are always fairly lengthy,
ish, but OK, much longer than the shell script and cpp
> b) in my experience, there are always situations where the posted
> solutions fail, and
could be. I don't have a copy of K&R handy to read to check mine. But it
looks bug free to me (well, it would, wouldn't it :-)
> c) it's trivial to just call the preprocessor which you know MUST work
> in all situations.
It is true that this solution will work on most setups easily.
> I told myself I wasn't going to bother shooting holes in this stuff any
> more, but I couldn't help myself:
It's a fun exercise.
> $ cat tst.c
> #include "stdio.h"
>
> int main(void) {
> printf("comments start with \"/*\"\n");
> }
You're correct. There's a bug.
if(c == "\"" && substr($0, i, 1) != "\\")
should read
if(c == "\"" && substr($0, i-1, 1) != "\\")
now it works :-)
complete working program for anyone who's interested:
{
#state
#0=normal, 1=in string , 2=in C comm, 3= in C++ comm
for(i=1; i <= length($0); i++)
{
c = substr($0, i, 1)
if(state == 0)
{
if(c == "/")
{
d = substr($0, i+1, 1)
if(d == "*")
state = 2
else if(d == "/" && c99)
state=3
else
printf("%s", c d)
i=i+2
continue
}
else if(c == "\"")
state = 1
}
else if(state == 1)
{
if(c == "\"" && substr($0, i-1, 1) != "\\")
state = 0
}
else if(state == 2 && i > 1 && substr($0, i-1,2) == "*/")
{
state = 0
c = " "
}
if(state < 2)
printf("%s", c)
}
if(state == 3 && !/\\$/)
state = 0
if(state < 2)
print ""
}
> On Mon, 27 Mar 2006 10:52:23 -0600, Ed Morton wrote:
>
>
>>I told myself I wasn't going to bother shooting holes in this stuff any
>>more, but I couldn't help myself:
>
>
> It's a fun exercise.
>
>
>>$ cat tst.c
>>#include "stdio.h"
>>
>>int main(void) {
>> printf("comments start with \"/*\"\n");
>>}
>
>
>
> You're correct. There's a bug.
>
> if(c == "\"" && substr($0, i, 1) != "\\")
>
> should read
>
> if(c == "\"" && substr($0, i-1, 1) != "\\")
>
>
> now it works :-)
<snip>
'fraid not:
$ cat tst.c
#include "stdio.h"
int main(void) {
/* the next line uses a trigraph instead of a backslash character */
printf("comments start with ??/"/*\"\n");
}
$ ./strip.gcc -trigraphs tst.c
#include "stdio.h"
int main(void) {
printf("comments start with \"/*\"\n");
}
$ awk -f tst.awk tst.c
#include "stdio.h"
int main(void) {
printf("comments start with ??/"$
Regards,
Ed.
Or even:
$ cat tst1.c
#include "stdio.h"
int main(void) {
/* a comment wrapping onto the next line *\
/
return 0;
}
$ ./strip.gcc tst1.c
#include "stdio.h"
int main(void) {
return 0;
}
$ awk -f tst.awk tst1.c
#include "stdio.h"
int main(void) {
$
I have to stop playing now before I have to go back into C comment
rehab. It's entirely possible that if we kept playing you'd come up with
some script that I personally can't think of a way to break without
spending more than a few minutes on it, but I'll personally never trust
anything but a C parser to strip C comments for the reasons I mentioned
earlier.
The trigraph counter-example I posted did give me pause for thought
though since my gcc/sed script wouldn't work if someone used a trigraph
instead of the "#", but that's an absolutely trivial fix compared to the
intricacies of stripping C comments.
Regards,
Ed.
>>>You CANNOT do this reliably without a C parser. See the discussions on
>>>comp.unix.shell.
>> I believe you are mistaken. Comments are (logically, but often not in
>> practice) performed entirely by the preprocessor, therefore you only
>> need at most a preprocessor parser.
> I don't know how you write a preprocessor without parsing the
> language, but maybe you're right as that's not really my area and I
> don't think it makes a difference to the point I was making.
You can write a preprocessor that distinguishes only between comments,
strings and other code.
>> Further, the processing done on comments is sufficiently simple that
>> it does not actually require that either.
> Not true. This has been debated to many times and so many "solutions"
> have been posted it's just incredible how much effort has been spent on
> this subject and none of them work, especially given that there's more
> than one C standard so you need to control which variants you're parsing
> so you know what a given structure means. For example, in ANSI C (C89)
> "//" is NOT a comment, but in C99 it is. Add that to the fact that it's
> easily done with a C preprocessor that it just seems so pointless to
> keep rehashing this, attempting different solutions and finding the
> holes in them.
I can assure you, as half of the Emacs CC Mode team (which includes full
support for AWK, by the way ;-), that C Comments can be recognised
without parsing the language beyond the level of comments and strings. C
comments can be fully recognised by regular expressions (because,
containing no unbounded nesting, they are finite-state mechanical),
though those regexps are considerably more involved than one might at
first expect.
The proof of the pudding is in the eating - Load a file.c into Emacs, and
its comments get correctly "fontified" (syntax highlighted), no matter
how cleverly you attempt to confuse it. Emacs does not contain a C
compiler.
> Regards,
> Ed.
Forgive my pessimism, but I've heard that so many times, including from
people saying that have a sed script that's used by N number of users in
such0and-such commercial applications...... I never use emacs nor do I
have a version handy right now so I'd be hard pressed to attempt this
even if I really wanted to, but out of curiosity:
a) How does it know which C version I'm editting so, for example, it can
highlight C++-style comments for C99 but not for C89?
b) How does it know whether or not trigraphs are enabled?
Presumably you have a version at hand, so please try it on these few
small counter-examples and let us know the result:
1)
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http://www.google.com);
}
2)
#include "stdio.h"
int main(void) {
printf("comments start with \"/*\" and end with \"*/\"\n");
}
3)
#include "stdio.h"
int main(void) {
printf("comments start with ??/"/*??/" and end with ??/"*/??/"\n");
}
4)
#include "stdio.h"
int main(void) {
/* a comment wrapping onto the next line *\
/
return 0;
}
In the above, the only one I'd expect to see comments highlighted in is
"4" since I use ANSI C, but those who use C99 would also expect to see
some highlighting in 1.
Also, the main point here is to strip C comments, so let us know if
emacs can do that with a flag and without going into a GUI editting
mode. If so, and I knew where to get the tool, I'd probably use that one
tool rather than the chain of sed/gcc commands I posted elsethread.
Regards,
Ed.
> I have to stop playing now before I have to go back into C comment
> rehab. It's entirely possible that if we kept playing you'd come up with
> some script that I personally can't think of a way to break without
> spending more than a few minutes on it, but I'll personally never trust
> anything but a C parser to strip C comments for the reasons I mentioned
> earlier.
>
> The trigraph counter-example I posted did give me pause for thought
> though since my gcc/sed script wouldn't work if someone used a trigraph
> instead of the "#", but that's an absolutely trivial fix compared to the
> intricacies of stripping C comments.
A nice bunch of examples. My program can be trivially amended to work in
these examples (a pattern at the beginning to join lines ending with a \,
and then a gsub() for all trigraphs). However, that is not my prefred
solution, since it does more than just strip comments: it also joins lines
and replaces trigraphs. My prefered solution would be to write (a now
rather more complex) state machine to try to preserve borken lines and
trigraphs--after all, its supposed to remove comments, not do half of the
pre processing.
It depends on what you want, really.
You are of course correct. Thanks for pointing out my typos.
I sometimes make similar typos when coding. Too many \/ sequences and
that sort of thing are hard to write as well as hard to read.
awk regular expressions can be terribly illegible if you are building a
complicated one with a lot of escaped characters. This has happened to
me a lot when parsing input that has a lot of characters like [|+*\.]
and friends.
Can you help me. I'm trying to figure out how to set up a gmail
account using only 3 characters for the user name. But the account set
up states that it requires 6 is there anyway to only have 3. I really
really need a certain name and so any help would be great.
The earlier examples given by Ed using trigraph and comment wrapping in
anotrher already defeat syntax highlighting in GNU Emacs 21.4.2.
Did you try as well?
The output I get is
#include "stdio.h"
#define GOOGLE(txt) printf("Google web page = " #txt "\n")
int main(void) {
GOOGLE(http://www.google.com);
}
#include "stdio.h"
int main(void) {
printf("comments start with \"/*\" and end with \"*/\"\n");
}
#include "stdio.h"
int main(void) {
printf("comments start with ??/"/*??/" and end with ??/"*/??/"\n");
}
#include "stdio.h"
int main(void) {
return 0;
}
when using the command
awk -f strip-c-comments.awk sample1-b.c
where strip-c-comments.awk contains
BEGIN { ORS = "" }
{ code = code $0 "\n" }
END {
while ( length(code) )
code = process( code )
}
function process( text )
{
if ( C99 )
{ if ( match( text, /"|'|\/\*|\/\// ) )
return span( text )
}
else
if ( match( text, /"|'|\/\*/ ) )
return span( text )
print text
return ""
}
function span( text , starter )
{
print substr( text, 1, RSTART - 1 )
starter = substr( text, RSTART, RLENGTH )
text = substr( text, RSTART + RLENGTH )
if ( "\"" == starter || "'" == starter )
return quoted( text, starter )
if ( "//" == starter )
return remove( text, "\n", "\n" )
## Allow for
## /* foo *\
## /
return remove( text, "\\*(\\\\\n)?/", " " )
}
function remove( text, ender, replacement )
{
print replacement
return substr( text, match(text, ender) + RLENGTH )
}
function quoted( text, starter )
{
if ( "'" == starter )
match( text, /^(\\.|[^'])*'/ )
else
match( text, /^(\\.|\?\?\/.|[^"])*"/ )