Stripping multiline C comments without using Lex

Slash

ungelesen,

31.01.2004, 11:11:1931.01.04

an

How could I remove multiline C comments from a C source code file in
an easy way?

Lex can do it in a real short way, but is there any easy way I could
do it using something like sed or awk?

I know its possible to write an awk script but, but is there any
simpler way to do it, using sed for instance (or anything else for
that matter)?

Icarus Sparry

ungelesen,

31.01.2004, 12:41:2231.01.04

an

No, for any reasonable definition of 'easy', if you hope to be robust.

See http://www.lib.uchicago.edu/keith/software/uncomment/ as an example.

If you are not concerned about being robust, in particular you hope that
you do not have a /* inside a quoted string, then using perl like

perl -0777 -pe 's:/\*.*?\*/::g'

is OK. This question is a PERL FAQ, which gives a better answer

$/ = undef;
$_ = <>;
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
print;

Stephane CHAZELAS

ungelesen,

31.01.2004, 13:32:0031.01.04

an

2004-01-31, 08:11(-08), Slash:

If your sed can handle very long lines:

sed 's/_/_n/g;s/</_o/g;s/>/_c/g;s,/\*,<,g;s,\*\/,>,g' | sed ':1
s/^$\([^"'\''<]\|'\''\(\\.\|[^\\]$'\''\|"$[^\\"]\|\\.$*"\)*\)<[^>]*>/\1/
t1
/^$[^"'\''<]\|'\''\(\\.\|[^\\]$'\''\|"$[^\\"]\|\\.$*"\)*$<.*\|"\([^\\"]\|\\.$*\)$/{
N;b1
}
s,<,/*,g;s,>,*/,g;s,_o,<,g;s,_c,>,g;s,_n,_,g'

On this input:

printf("/* \"%c*//*", '"'); /* takes into account
comments within strings, and has to pay attention
to " inside '. Note that "/*" is allowed within
a comment, */ puts("multi-line strings are \"
/* supported */. That's all");

Gives:

printf("/* \"%c*//*", '"'); puts("multi-line strings are \"
/* supported */ that's all");

Detailed:

sed 's/_/_n/g;s/</_o/g;s/>/_c/g;s,/\*,<,g;s,\*\/,>,g'

We use _<n> escape sequence to have /* and */ changed into single
characters ('<' and '>'): _ (the escape character) is first
self-escaped if it occurs in the input (-> _n), then original
'<' are turned into '_o', '>' into '_c', and then '/*' into '<'
and '*/' into '>'.

| sed ':1
s/^$\([^"'\''<]\|'\''\(\\.\|[^\\]$'\''\|"$[^\\"]\|\\.$*"\)*\)<[^>]*>/\1/
remove a comment as long as the opening tag (<) doesn't occur
within a quote. To do so, we only accept '<' that follow a
sequence of either non-quote characters ([^'"<]) or closed C char
constant (either 'x' or '\x') (because of '"' in which case we
don't want " to be taken as an opening quote) or a string ("
followed by any sequence of non backslash or \x (to take into
accound "..\"..") followed by a quote).

t1
if that matched, try again

/^$[^"'\''<]\|'\''\(\\.\|[^\\]$'\''\|"$[^\\"]\|\\.$*"\)*$<.*\|"\([^\\"]\|\\.$*\)$/{
N;b1
}
Check for an unclosed quote or comment (if so, append the next line
to the pattern space and try again).

s,<,/*,g;s,>,*/,g;s,_o,<,g;s,_c,>,g;s,_n,_,g'

unescape the _n (restore the '<' into '/*', '_o' into '<').

(not much tested).

--
Stéphane ["Stephane.Chazelas" at "free.fr"]

Stephane CHAZELAS

ungelesen,

31.01.2004, 14:53:3231.01.04

an

2004-01-31, 19:32(+01), Stephane CHAZELAS:

> 2004-01-31, 08:11(-08), Slash:
>> How could I remove multiline C comments from a C source code file in
>> an easy way?

[...]
> sed [...] | sed [...]

With one sed (works with GNU sed):

#! /usr/bin/sed -f
:1
s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*\)/\*$[^*]\|\*\**[^/*]$*\*/:\1:
t1
\:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*\)$/\*.*\|"\([^\\"]\|\\.$*\)$:{
N;b1
}

(not thoroughly tested).

Stephane CHAZELAS

ungelesen,

31.01.2004, 15:28:1531.01.04

an

2004-01-31, 19:32(+01), Stephane CHAZELAS:

> 2004-01-31, 08:11(-08), Slash:
>> How could I remove multiline C comments from a C source code file in
>> an easy way?

[...]
> sed [...] | sed [...]

With one sed (works with GNU sed):

#! /usr/bin/sed -f
:1
s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\*/:\1:
t1
\:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
N;b1
}

(not thoroughly tested).

Carlos J. G. Duarte

ungelesen,

31.01.2004, 17:23:3531.01.04

an

Slash wrote:
>
> I know its possible to write an awk script but, but is there any
> simpler way to do it, using sed for instance (or anything else for
> that matter)?

The sed grab bag has (sed) scripts for that.
This is the best:
http://sed.sourceforge.net/grabbag/scripts/remccoms3.sed

There is also these two:
http://sed.sourceforge.net/grabbag/scripts/remccoms2.sh.txt
http://sed.sourceforge.net/grabbag/scripts/remccoms1.sed

The first is a variant of an ancient sed script by me:
http://cgd.sdf-eu.org/a/scripts/devel/del-c-cmnt.sed ('96)

--
carlos ** http://cgd.sdf-eu.org

Stephane CHAZELAS

ungelesen,

31.01.2004, 18:35:1031.01.04

an

2004-01-31, 17:41(+00), Icarus Sparry:
[...]

> is OK. This question is a PERL FAQ, which gives a better answer
>
> $/ = undef;
> $_ = <>;
> s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
> print;

s#/\*.*?\*/|("(\\.|.)*?"|'\\?.'|.)#$1#sg;

should be enough with perl regexps.

$ perl -0777 -pe 's#/\*.*?\*/|("(?:\\.|.)*?"|'\''\\?.'\''|.)#$1#sg' << EOF
heredoc> printf("/* \"%c*//*", '"'); /* takes into account
heredoc> comments within strings, and has to pay attention
heredoc> to " inside '. Note that "/*" is allowed within
heredoc> a comment, */ puts("multi-line strings are \"
heredoc> /* supported */. That's all");
heredoc> EOF

printf("/* \"%c*//*", '"'); puts("multi-line strings are \"

/* supported */. That's all");

--
Stéphane ["Stephane.Chazelas" at "free.fr"]

Stephane CHAZELAS

ungelesen,

01.02.2004, 03:57:2301.02.04

an

2004-01-31, 21:28(+01), Stephane CHAZELAS:
[...]

> #! /usr/bin/sed -f
> :1
> s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\*/:\1:
> t1
> \:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\]$'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
> N;b1
> }

[...]

Oops, that doesn't work for /* foo **/ and I for got the '\033' case:

#! /usr/bin/sed -f
:1
s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\**\*/:\1:
t1
\:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
N;b1
}

--
Stéphane ["Stephane.Chazelas" at "free.fr"]

Stephane CHAZELAS

ungelesen,

01.02.2004, 05:12:2801.02.04

an

2004-01-31, 22:23(+00), Carlos J. G. Duarte:

> Slash wrote:
>>
>> I know its possible to write an awk script but, but is there any
>> simpler way to do it, using sed for instance (or anything else for
>> that matter)?
>
> The sed grab bag has (sed) scripts for that.
> This is the best:
> http://sed.sourceforge.net/grabbag/scripts/remccoms3.sed

[...]

That one is neat as it's POSIX conformant and doesn't put
several lines in the pattern or hold space.

Some benchmarks:

on a file 2 MB large (the concatenation of every .h and .c file
in elinks source code) in a C locale

remccoms3.sed
140.75s user 0.29s system 96% cpu 2:25.41 total

#! /usr/bin/sed -f
:1
s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\**\*/:\1:
t1
\:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
N;b1
}

56.08s user 0.24s system 99% cpu 56.860 total

#! /usr/bin/perl

$/ = undef;
$_ = <>;
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
print;

5.81s user 0.36s system 96% cpu 6.378 total

#! /usr/bin/perl

$/ = undef;
$_ = <>;

s#/\*.*?\*/|("(?:\\.|.)*?"|'(?:\\.)?.*?'|.[^'"/]*)#$1#sg;
print;
4.53s user 0.42s system 100% cpu 4.906 total

Stephane CHAZELAS

ungelesen,

01.02.2004, 05:22:3101.02.04

an

2004-02-1, 00:35(+01), Stephane CHAZELAS:

> 2004-01-31, 17:41(+00), Icarus Sparry:
> [...]
>> is OK. This question is a PERL FAQ, which gives a better answer
>>
>> $/ = undef;
>> $_ = <>;
>> s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
>> print;
>
> s#/\*.*?\*/|("(\\.|.)*?"|'\\?.'|.)#$1#sg;

Oops, I forgot the '\0oo' case (and adding [^/'"]* speeds it up
greatly).

s#/\*.*?\*/|("(?:\\.|.)*?"|'(?:\\.)?.*?'|.[^'"/]*)#$1#sg;

--
Stéphane ["Stephane.Chazelas" at "free.fr"]

Ed Morton

ungelesen,

01.02.2004, 09:20:1701.02.04

an

Slash wrote:

> How could I remove multiline C comments from a C source code file in
> an easy way?

Just tell the preprocessor to do it. "gcc -E" will throw away comments.
Unfortunately it also expands included header files and macros so you
need to change all the "#" signs to something else before gcc then
change them back afterwards, e.g.:

sed 's/#/__HASH__REP__/g' file.c |
gcc -E - |
sed -e '1d' -e 's/__HASH__REP__/#/g'

Gcc also adds a first line to its output with some data on it, hence the
'1d' to get rid of that.

Regards,

Ed.

Stephane CHAZELAS

ungelesen,

01.02.2004, 10:05:4001.02.04

an

2004-02-01, 08:20(-06), Ed Morton:
[...]

> sed 's/#/__HASH__REP__/g' file.c |
> gcc -E - |
> sed -e '1d' -e 's/__HASH__REP__/#/g'
>
> Gcc also adds a first line to its output with some data on it, hence the
> '1d' to get rid of that.

[...]

YMMV

~$ gcc --version
gcc (GCC) 3.2.2
[...]
~$ echo foo | gcc -E -
# 1 "<stdin>"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "<stdin>"
foo

And, there are not only problems with '#':

~$ gcc -E - << \EOF
heredoc> __FILE__ __LINE__ __GNUC__
heredoc> __DATE__
heredoc> foo\
heredoc> bar </* baz */>
heredoc> EOF
# 1 "<stdin>"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "<stdin>"
"<stdin>" 1 3
"Feb 1 2004"
foobar < >

(note that comments are replaced with a " ").

Better like this:

sed 's/b/bB/g;s/#/bh/g;s/_/bu/g;s/\\$/bs/' |
gcc -E -P -no-gcc - |
sed 's/bs/\\/g;s/bu/_/g;s/bh/#/g;s/bB/b/g'

Note that trick instead of the "__HASH__REP__" one:

sed 's/e/eE/g;s/string-to-escape/eW/g' | ... |
sed 's/eW/string-to-escape/g;s/eE/e/g'

Where "e" acts as an escape character. You don't have the
problem of the "__HASH__REP__" that might already be present in
the input.

Stephane CHAZELAS

ungelesen,

01.02.2004, 10:34:1901.02.04

an

2004-02-1, 09:57(+01), Stephane CHAZELAS:

[...]
> #! /usr/bin/sed -f
> :1

> s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\**\*/:\1:
> t1
> \:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
> N;b1
> }

Well, that doesn't work with an input like:

a = '\
a'; /* comment */

(which the preprocessor is supposed to change "into a = 'a';"
AFAICS).

Ed Morton

ungelesen,

01.02.2004, 11:05:4401.02.04

an

Stephane CHAZELAS wrote:
<snip>

> (note that comments are replaced with a " ").

Comments MUST be replaced by a " " rather than just removed so that this:

#define X/*define X*/Y

becomes the equivalent:

#define X Y

rather than:

#define XY

<snip>

> Note that trick instead of the "__HASH__REP__" one:
>
> sed 's/e/eE/g;s/string-to-escape/eW/g' | ... |
> sed 's/eW/string-to-escape/g;s/eE/e/g'
>
>
> Where "e" acts as an escape character. You don't have the
> problem of the "__HASH__REP__" that might already be present in
> the input.

In reality, I use a script that include "$$" in the hash replacement
pattern but thanks for that tip too.

Ed.

Stephane CHAZELAS

ungelesen,

01.02.2004, 11:46:1801.02.04

an

2004-02-01, 10:05(-06), Ed Morton:

> <snip>
>> (note that comments are replaced with a " ").
>
> Comments MUST be replaced by a " " rather than just removed so that this:
>
> #define X/*define X*/Y

[...]

You're right, so all the sed or perl based solutions provided so
far are wrong.

A new attempt:

#! /usr/bin/sed -f
:1
s:^$\([^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*/*\)/\*$[^*]\|\*\**[^/*]$*\**\*/:\1 :
t1
\:^$[^"'/]\|//*[^"'/*]\|/*'\(\\.\|[^\\']$*'\|/*"$[^\\"]\|\\.$*"\)*$/\*.*\|"\([^\\"]\|\\.$*\)$:{
N;b1
}

For the perl one, that's not that easy because of the special
way it is done (every piece of text is replaced, when it's a
comment, by nothing, when not, by itself).

perl -0777 -pe'
s{
/\*.*?\*/ (?{$s=" "})

| (
"(?:\\.|.)*?"
| '\''(?:\\.)?.*?'\''
| .[^'\''"/]*

) (?{$s=""})
}{$s$1}xsg'

Or:

perl -0777 -pe'
s{

/\*.*?\*/
| (
"(?:\\.|.)*?"
| '\''(?:\\.)?.*?'\''
| .[^'\''"/]*
)

}{if (length$1){$1}else{" "}}gsex'

Ed Morton

ungelesen,

02.02.2004, 10:03:4102.02.04

an

Stephane CHAZELAS wrote:
> 2004-02-01, 10:05(-06), Ed Morton:
>
>><snip>
>>
>>>(note that comments are replaced with a " ").
>>
>>Comments MUST be replaced by a " " rather than just removed so that this:
>>
>> #define X/*define X*/Y
>
> [...]
>
> You're right, so all the sed or perl based solutions provided so
> far are wrong.

I just got a chance to look at your other comments:

> ~$ gcc --version
> gcc (GCC) 3.2.2
> [...]
> ~$ echo foo | gcc -E -
> # 1 "<stdin>"
> # 1 "<built-in>"
> # 1 "<command line>"
> # 1 "<stdin>"
> foo

Here's what I get:

PS1> gcc --version
2.95.2
PS1> echo foo | gcc -E -
# 1 ""
foo

so you're right, there's different results possible due to gcc version
and/or environment. I can't reproduce exactly what you got, but I can
get rid of the #1 "" from my output by using -P; does that get rid of
all of your # 1 ... lines too? The obvious alternative is to use sed to
delete all lines that start with # rather than just deleting the first
line, e.g.:

sed 's/#/__HASH__REP__/g' file.c |
gcc -E - |

sed -e '/^#/d' -e 's/__HASH__REP__/#/g'

> And, there are not only problems with '#':
>
> ~$ gcc -E - << \EOF
> heredoc> __FILE__ __LINE__ __GNUC__
> heredoc> __DATE__
> heredoc> foo\
> heredoc> bar </* baz */>
> heredoc> EOF
> # 1 "<stdin>"
> # 1 "<built-in>"
> # 1 "<command line>"
> # 1 "<stdin>"

Addressed above.

> "<stdin>" 1 3
> "Feb 1 2004"

You're right, I forgot about those pesky pre-defined macros. If there's
no gcc option to get rid of them (and I think there might be but I can't
be bothered to go through the man page) then the obvious solution for
this one is to change all double underscores to some symbol
too, e.g.:

sed -e 's/__/__2USS_REP__/g' -e 's/#/__HASH__REP__/g' file.c |
gcc -E - |
sed -e '/^#/d' -e 's/__HASH__REP__/#/g' -e 's/__/__2USS_REP__/g'

> foobar < >

Not a problem as this is the equivalent post-processed C.

So, if we then include your suggestion of an alternative way of
replacing the patterns that guarantees no clashes and get rid of the
"-e"s and if we assume that "-P" gets rid of all the information lines
that gcc outputs (add the sed '/^#/d' back if it doesn't), we get:

sed 's/a/aA/g;s/__/aB/g;s/b/bA/g;s/#/bB/g' file.c |
gcc -P -E - |
sed 's/bB/#/g;s/bA/b/g;s/aB/__/g;s/aA/a/g'

which I think is still much more readable and more likely to work for
all C programs than any straight sed or perl alternatives since you're
relying on the preprocessor identifying comments and handling their
replacement appropriately rather than just a script that someone wrote
based on what they thought was an exhaustive set of possibilities.

Regards,

Ed.

Stephane CHAZELAS

ungelesen,

02.02.2004, 11:30:5402.02.04

an

2004-02-02, 09:03(-06), Ed Morton:
[...]

> sed 's/a/aA/g;s/__/aB/g;s/b/bA/g;s/#/bB/g' file.c |

You don't need several escape characters.

sed 's/e/eE/g;s/__/eA/g;s/#/eB/g' file.c |

> gcc -P -E - |
> sed 's/bB/#/g;s/bA/b/g;s/aB/__/g;s/aA/a/g'
>
> which I think is still much more readable and more likely to work for
> all C programs than any straight sed or perl alternatives since you're
> relying on the preprocessor identifying comments and handling their
> replacement appropriately rather than just a script that someone wrote
> based on what they thought was an exhaustive set of possibilities.

not agreed. The "request" was "remove comments", a pre-processor
does much more than that, some do non standard things (OP didn't
speak of ANSI compliant C), and as you pre-process the input,
you may disturb cpp processing (and gcc -E is GNU specific)

"what one thinks of the exhaustive set of possibilities for C
comments" or "what one thinks of the exhaustive list of a
pre-processor processings" are equally likely not to be accurate
(did you know about trigraphs?, I don't agree it's correct to
turn "\\\n" into "" [WRT OP's request]).

The advantage of the "cpp" way is that it guarantees that, if
the code compiles before, it will after (unless the sed
preprocessing disturbed cpp), but it may not answer the OP's
question. Now, the OP can choose, and it's good that we've
discussed all those points. We could now wonder what would be a
portable solution (using cpp instead of "gcc -E", is "-P" or "-"
portable? In which conditions are trigraphs enabled by
default (or should be enabled)...)

$ echo '??=define A B\nA ??/* qsd */' | gcc -ansi -P -E -
B \* qsd */

(Note the "??'" where "'" is not to be taken as a "'"! (and
breaks the sed and perl solutions) and ??/* which is not a
comment).

Ed Morton

ungelesen,

02.02.2004, 15:09:5602.02.04

an

Stephane CHAZELAS wrote:
> 2004-02-02, 09:03(-06), Ed Morton:
> [...]
>
>>sed 's/a/aA/g;s/__/aB/g;s/b/bA/g;s/#/bB/g' file.c |
>
>
> You don't need several escape characters.
>
> sed 's/e/eE/g;s/__/eA/g;s/#/eB/g' file.c |

True. Thanks for catching it.

>
>> gcc -P -E - |
>> sed 's/bB/#/g;s/bA/b/g;s/aB/__/g;s/aA/a/g'
>>
>>which I think is still much more readable and more likely to work for
>>all C programs than any straight sed or perl alternatives since you're
>>relying on the preprocessor identifying comments and handling their
>>replacement appropriately rather than just a script that someone wrote
>>based on what they thought was an exhaustive set of possibilities.
>
>
> not agreed. The "request" was "remove comments", a pre-processor
> does much more than that, some do non standard things (OP didn't
> speak of ANSI compliant C), and as you pre-process the input,
> you may disturb cpp processing (and gcc -E is GNU specific)

I get where you're coming from, but I don't believe any sed, awk, or
perl solution will be foolproof, so I'd rather be subjected to and solve
quirks with the preprocessor than try to debug a fairly complex script.

> "what one thinks of the exhaustive set of possibilities for C
> comments" or "what one thinks of the exhaustive list of a
> pre-processor processings" are equally likely not to be accurate
> (did you know about trigraphs?, I don't agree it's correct to
> turn "\\\n" into "" [WRT OP's request]).

Yes I know about trigraphs, but I don't see why you're bringing them up
here since the preprocessor won't do anything at all with them unless
you pass it a "-trigraph" (or "-ansi") option. Were you thinking rather
of escape sequences? Even then, I still don't really see a problem as,
for example, "\\\n" will not be changed at all by gcc -E in any context
I can come up with.

> The advantage of the "cpp" way is that it guarantees that, if
> the code compiles before, it will after (unless the sed
> preprocessing disturbed cpp), but it may not answer the OP's
> question. Now, the OP can choose, and it's good that we've
> discussed all those points.

I agree. There's also a third choice: Lucent has a free tool called
"ncsl" available at http://www.lucentssg.com/displayProduct.cfm?prodid=33

It strips all comments and indentation so just run an indenter (e.g.
"indent") or a C beautifier (e.g. "cb" - google for "cb download
beautifier" and take your pick) on the output to get it back in readable
format. Disclaimer - I've never used this specific download of "ncsl",
I've just used the version provided on UNIX boxes within Lucent.

We could now wonder what would be a
> portable solution (using cpp instead of "gcc -E", is "-P" or "-"
> portable? In which conditions are trigraphs enabled by
> default (or should be enabled)...)

-ansi is the only gcc option I know of that enables trigraphs by
default. "-P -" is common to both "cpp" and "gcc -E". I don't know about
other preprocessors.

> $ echo '??=define A B\nA ??/* qsd */' | gcc -ansi -P -E -
> B \* qsd */
>
> (Note the "??'" where "'" is not to be taken as a "'"! (and
> breaks the sed and perl solutions) and ??/* which is not a
> comment).

Just don't specify "-ansi" or "-trigraph" and your ouput will be a bit
more reasonable:

PS1> echo '??=define A B\nA ??/* qsd */' | gcc -P -E -
??=define A B
A ??

but you're right that doesn't match the effect of the original. Then
again, the original wouldn't compile so as long as we stick to running
this only on compilable code, I think it's a reasonable solution and I
don't think it'd be worth adding yet another sed replacement sequence
for "??"s to handle trigraphs.

FWIW, if you run the Lucent "ncsl" tool on the above, you get the same
output as "gcc -E" without trigraphs.

Regards,

Ed.

Stephane CHAZELAS

ungelesen,

02.02.2004, 17:21:0102.02.04

an

2004-02-02, 14:09(-06), Ed Morton:
[...]

> Yes I know about trigraphs, but I don't see why you're bringing them up
> here since the preprocessor won't do anything at all with them unless
> you pass it a "-trigraph" (or "-ansi") option.

With _your_ version of GNU cpp. OP didn't specify which system
he was running, nor the version of its preprocessor. As
trigraphs seem to be ansi, we can expect that some preprocessors
(and possibly future versions of GNU cpp) will handle them. But
that was more related to my "how much portable is it" or what
"general solution would be put in a FAQ".

> Were you thinking rather
> of escape sequences? Even then, I still don't really see a problem as,
> for example, "\\\n" will not be changed at all by gcc -E in any context
> I can come up with.

No I meant a backslash character followed by a newline
characters, which cpp removes.

~$ printf %b 'foo\\\nbar\n' | cpp -P -
foobar

The problem with trigraphs is that we have to know if they are
to be supported or not.

If the C file contains:

a = 3 ??' 5;

gcc -E - (assuming it doesn't process trigraphs)
will fail (because of the unmatched quote).

A solution is to handle them with the "sed" pre-processor.

So:

sed '# self-escape
s/e/eE/g

# escape ^ trigraph
s/??'\''/eQ/g

# escape underscores (for __LINE__...)
s/_/eU/g

# escape sharps (for #include...)
s/#/eH/g

# escape \ at end of line to prevent line-joining
s/\\$/eB/' |
cpp -P | sed 's/eB/\\/g;s/eH/#/g;s/eU/_/g;s/eQ/'\''/g;s/eE/e/g'

Then again, are you sure there's not another pitfall?

A perl solution that handles "??'":

perl -0777 -pe'
s{/\*.*?\*/
| (
"(?:\\.|.)*?"
|'\''(?:\\.)?.*?'\''

|\?\?'\''

|.[^'\''"/]*
)

}{if ($1eq""){" "}else{$1}}exsg'

Ed Morton

ungelesen,

02.02.2004, 18:27:1002.02.04

an

Stephane CHAZELAS wrote:
> 2004-02-02, 14:09(-06), Ed Morton:

<snip>

> that was more related to my "how much portable is it" or what
> "general solution would be put in a FAQ".

This has been addressed several times in comp.lang.c but we've never
reached a consensus so it isn't in that FAQ either.

<snip>

> The problem with trigraphs is that we have to know if they are
> to be supported or not.
>
> If the C file contains:
>
> a = 3 ??' 5;
>
> gcc -E - (assuming it doesn't process trigraphs)
> will fail (because of the unmatched quote).

True. I guess we do have to call the preprocessor for stripping comments
with whichever options we'd eventually pass to the preprocessor for
compilation, e.g. "-trigraph".

> A solution is to handle them with the "sed" pre-processor.

<snip>

> A perl solution that handles "??'":

<snip>

The more of this that's done in sed/perl/awk the less I'd trust it to be
complete and the less I'd want to have to debug it.

How about just stating that you can use the preprocessor solution
provided that:

a) The program is compilable, and
b) The preprocessor gets called with all applicable compilation options, and
c) whichever preprocesor you use supports the assumed functionality.

We could post both this and a straight sed/perl/whatever solution in the
FAQ (and possibly have the comp.lang.c FAQ reference it).

Ed.

Stephane CHAZELAS

ungelesen,

03.02.2004, 04:23:4003.02.04

an

2004-02-02, 17:27(-06), Ed Morton:
[...]

>> If the C file contains:
>>
>> a = 3 ??' 5;
>>
>> gcc -E - (assuming it doesn't process trigraphs)
>> will fail (because of the unmatched quote).
>
> True. I guess we do have to call the preprocessor for stripping comments
> with whichever options we'd eventually pass to the preprocessor for
> compilation, e.g. "-trigraph".

Except that if you pass -trigraph, trigraphs will get expanded.
That's not "removing comments".

On the other hand, if trigraphs are not meant to be expanded,
I can see no way were the "'" in a "??'" can be taken as a quote
character, so it's also safe to temporary replace it with a safe
string in that case.

[...]

> How about just stating that you can use the preprocessor solution
> provided that:
>
> a) The program is compilable, and
> b) The preprocessor gets called with all applicable compilation options, and

Some of those options may cause the preprocessor to do other
things than comment stripping.

With a perl/awk/sed/ruby/C/python/intercal solution, you can
build a solution that does its best in most situations.
(Note that the one I provided doesn't take care of C++ comments,
some C compilers don't support C++ comments, but then, there's
no way // appears in the C file, so its safe to strip //[^\n]*)

perl -0777 -pe'
s{
/\*.*?\*/

| //[^\n]*

Ed Morton

ungelesen,

03.02.2004, 15:28:3803.02.04

an

Stephane CHAZELAS wrote:
<snip>

> perl -0777 -pe'
> s{
> /\*.*?\*/
> | //[^\n]*
> | (
> "(?:\\.|.)*?"
> | '\''(?:\\.)?.*?'\''
> | \?\?'\''
> | .[^'\''"/]*
> )
> }{if ($1eq""){" "}else{$1}}exsg'
>

Try the above on this code:

#include "stdio.h"

#define GOOGLE(txt) printf("Google web page = " #txt "\n")

int main(void) {
GOOGLE(http://www.google.com);
}

and it'll produce:

#include "stdio.h"

#define GOOGLE(txt) printf("Google web page = " #txt "\n")

int main(void) {
GOOGLE(http:
}

Using "gcc -E -ansi" handles it OK.

Ed.

Stephane CHAZELAS

ungelesen,

04.02.2004, 05:33:0804.02.04

an

2004-02-03, 14:28(-06), Ed Morton:
[...discussing about the best way to strip comments from a C file...]

> Try the above on this code:
>
> #include "stdio.h"
>
> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>
> int main(void) {
> GOOGLE(http://www.google.com);
> }

[...]

> Using "gcc -E -ansi" handles it OK.

[... while // could be taken as a comment otherwise]

I didn't expect that to work. Are you sure it is valid ANSI C
code? For me, stringizing only makes sense for valid C
expressions (or at least parts of valid C expressions) for
logging/debugging purpose or the like. When the argument of a
macro is just intented to be used only as a string, it's more
sensible to write it as

#define GOOGLE(txt) printf("Google web page = " txt "\n")
...
GOOGLE("http://www.google.com");

I'd use stringizing for example for:

~$ cpp -P << EOF
heredoc> #define check(cond) { if (!(cond)) { fprintf(stderr, \
heredoc> "condition \"" #cond "\" not met\n."; exit(2); } }
heredoc> ...
heredoc> check(length < sizeof(buffer))
heredoc> EOF
...
{ if (!(length < sizeof(buffer))) { fprintf(stderr, "condition \"" "length < sizeof(buffer)" "\" not met\n."; exit(2); } }

(i.e. where "cond" is a syntactically valid C expression).

[x-post, no fu2 (feel free to add one)]

Chris Torek

ungelesen,

04.02.2004, 20:01:1804.02.04

an

In article <slrnc21ij4.40.s...@spam.is.invalid>

Stephane CHAZELAS <this.a...@is.invalid> writes:
>2004-02-03, 14:28(-06), Ed Morton:
>[...discussing about the best way to strip comments from a C file...]
>> Try the above on this code:
>>
>> #include "stdio.h"
>>
>> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
>>
>> int main(void) {
>> GOOGLE(http://www.google.com);
>> }
>[...]
>> Using "gcc -E -ansi" handles it OK.
>[... while // could be taken as a comment otherwise]
>
>I didn't expect that to work. Are you sure it is valid ANSI C
>code?

The "stringize" operator, and indeed the entire preprocessor, works
on tokens, or more precisely, a sequence of "preprocessing-token"s.

Preprocessing tokens are defined as:

preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
operator
punctuator
each non-white-space character that cannot be one of the above

(from a C99 draft, but should be close enough).

The C89 and C99 standards differ in an important way here: in C99,
// is a comment. In C89, // is simply two slashes. Translation
proceeds in "phases" and comments are replaced with a single space
character in phase 3, while preprocessing directives and macro
invocations are handled in phase 4.

Thus, in C99, before any macro processing (including stringizing)
can occur, the sequence "GOOGLE(http://www.google.com);" turns into
"GOOGLE(http: ". The closing parenthesis is missing and you must
get a diagnostic. (Double quotes here are simply to allow for
whitespace.)

In C89, on the other hand, the text survives phase 3, and the
pp-token sequence is:

GOOGLE
(
http
:
/
/
www
.
google
.
com
)
;

The stringizing operator "#" allows a complete token sequence
and should produce the string-literal "http://www.google.com"
in this case.

Thus, whether this works depends on whether your compiler
implements the new 1999 standard ("doesn't work") or the
old 1989 one ("does work"), perhaps with the 1995 updates
(no change to whether this works).
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Stephane CHAZELAS

ungelesen,

05.02.2004, 04:32:5405.02.04

an

2004-02-5, 01:01(+00), Chris Torek:
[...]

> The "stringize" operator, and indeed the entire preprocessor, works
> on tokens, or more precisely, a sequence of "preprocessing-token"s.
>
> Preprocessing tokens are defined as:
>
> preprocessing-token:
> header-name
> identifier
> pp-number
> character-constant
> string-literal
> operator
> punctuator
> each non-white-space character that cannot be one of the above
>
> (from a C99 draft, but should be close enough).

[...]
> In C89, on the other hand, the text [GOOGLE(http://www.google.com)]

> survives phase 3, and the pp-token sequence is:
>
> GOOGLE
> (
> http
> :
> /
> /
> www
> .
> google
> .
> com
> )
> ;
>
> The stringizing operator "#" allows a complete token sequence
> and should produce the string-literal "http://www.google.com"
> in this case.

[...]

Thanks for that very detailed answer. But, there are still
points unclear to me. blanks are not tokens, so I guess they are
just ignored. But how do the stringizing operator join the
tokens from a pp-tokens list. From what you say, it seems that
they are stuck together, but in

#define s(t) #t
s(//)
s(1 + 1)
s(1 + 1)
s(1+1)

I get, with GNU cpp -P -ansi
"//"
"1 + 1"
"1 + 1"
"1+1"

(spaces seem to have an influence somehow).

And, I guess that when calling a macro, there are things you
can't do that restrict the range of possible strings that can be
stringized.

For instance, it seems impossible to stringize "foo)", or
"foo," (or "/*", or 'a, or "aer...), that's why I thought in
the first place that there had to be rules on what is allowed
for either a macro argument or for the stringizing operator, and
that http://www.google.com might break those rules (but I can
see now that it's very likely that it breaks no rule [except in
C99]).

Jens Schweikhardt

ungelesen,

05.02.2004, 08:42:2505.02.04

an

In comp.unix.shell Stephane CHAZELAS <this.a...@is.invalid> wrote:
...
# Thanks for that very detailed answer. But, there are still
# points unclear to me. blanks are not tokens, so I guess they are
# just ignored. But how do the stringizing operator join the
# tokens from a pp-tokens list. From what you say, it seems that
# they are stuck together, but in
#
# #define s(t) #t
# s(//)
# s(1 + 1)
# s(1 + 1)
# s(1+1)
#
# I get, with GNU cpp -P -ansi
# "//"
# "1 + 1"
# "1 + 1"
# "1+1"
#
# (spaces seem to have an influence somehow).

The C (99) Standard requires in 6.10.3.2#2 that "... Each occurrence of
white space between the argument's preprocessing tokens becomes a single
space character in the string literal. White space before the first pp
token and after the last pp token composing the argument is deleted."

Regards,

Jens
--
Jens Schweikhardt http://www.schweikhardt.net/
SIGSIG -- signature too long (core dumped)

pyxchi...@gmail.com

ungelesen,

16.07.2012, 04:53:4016.07.12

an

On Sunday, February 1, 2004 12:11:19 AM UTC+8, Slash wrote:
> How could I remove multiline C comments from a C source code file in
> an easy way?
>

> Lex can do it in a real short way, but is there any easy way I could
> do it using something like sed or awk?
>

> I know its possible to write an awk script but, but is there any
> simpler way to do it, using sed for instance (or anything else for
> that matter)?

Is this problem related to something like 'context-free grammar' ?

If so, from my point of view, a push-down-automaton-style program will be sufficient.

I think a way is to maintain a 'context' which can be (1) nothing (2) in quotes or (3) in '/**/'. The program begins with context defaulted to 'nothing' and then proceed according input line and the context.

Refer to http://en.wikipedia.org/wiki/Context-free_grammar for more info on 'context-free grammar'.

Ben Bacarisse

ungelesen,

16.07.2012, 07:44:5316.07.12

an

pyxchi...@gmail.com writes:

> On Sunday, February 1, 2004 12:11:19 AM UTC+8, Slash wrote:
>> How could I remove multiline C comments from a C source code file in
>> an easy way?
>>
>> Lex can do it in a real short way, but is there any easy way I could
>> do it using something like sed or awk?
>>
>> I know its possible to write an awk script but, but is there any
>> simpler way to do it, using sed for instance (or anything else for
>> that matter)?
>
> Is this problem related to something like 'context-free grammar' ?

Did you see that you are replying to a question that is more than 8
years old?

<snip>
--
Ben.

Janis Papanagnou

ungelesen,

16.07.2012, 08:30:2816.07.12

an

Am 16.07.2012 13:44, schrieb Ben Bacarisse:
> pyxchi...@gmail.com writes:
>
>> On Sunday, February 1, 2004 12:11:19 AM UTC+8, Slash wrote:

>>> [...]
>>
>> [...]

>
> Did you see that you are replying to a question that is more than 8
> years old?

For an *archive* like Google, 8 years is (sort of) "news".

It's always amazing how those "googlers" perceive Usenet.

Janis

>
> <snip>
>