Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Programming exercise/challenge

776 views
Skip to first unread message

Tim Rentsch

unread,
Dec 5, 2020, 11:25:36 AM12/5/20
to
Prompted by some recent discussion regarding 'goto' statements and
state machines, I would like to propose a programming exercise.
(It is perhaps a bit too large to be called an exercise, but not so
difficult that it deserves the label of challenge. On the other
hand there are some constraints so maybe challenge is apropos. In
any case somewhere in between those two bounds.)

Short problem statement: a C program to remove comments from C
source input.

Specifics: Remove both /*...*/ and //... style comments. Don't
worry about trigraphs. Read from stdin, write to stdout, and
diagnostics (if any) go to stderr. If EOF is seen inside a
comment, do something sensible but it doesn't matter what as long
as it's sensible. Use no 'goto' statements. Limit function
bodies to no more than 25 lines.

Other: feel free to handle corner cases as you see fit, as long
as there is some description of what choice was made.

Hopefully it will be a fun exercise. It isn't trivial but it
shouldn't take too long either.

Sjouke Burry

unread,
Dec 5, 2020, 11:33:40 AM12/5/20
to
If you show yours , maybe some of us will show theirs........

Kaz Kylheku

unread,
Dec 5, 2020, 11:40:18 AM12/5/20
to
On 2020-12-05, Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:
> as it's sensible. Use no 'goto' statements. Limit function
> bodies to no more than 25 lines.

How about versions, A and B, of this exercise:

A - Use no goto statements.

B - Use nothing but goto statements (and essential function calls).

--
TXR Programming Language: http://nongnu.org/txr

Kaz Kylheku

unread,
Dec 5, 2020, 11:58:32 AM12/5/20
to
On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
> On 2020-12-05, Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:
>> as it's sensible. Use no 'goto' statements. Limit function
>> bodies to no more than 25 lines.
>
> How about versions, A and B, of this exercise:
>
> A - Use no goto statements.
>
> B - Use nothing but goto statements (and essential function calls).

B attempt. Obviously, this flouts the 25 line rule which makes more
sense for A solutions.

Notes: I cranked this out almost as fast as I can type.

All the time I was thinking, wow this is easy! I'm no wasting any brain
cycles on 'how do I shoehorn this obvious flowchart into control
structures?'"

Furthermore, my code compiled the first time with no diagnostics under
gcc -W -Wall -ansi -pedantic and had only one obvious flaw at that time:
it completely deleted /*..*/ comments. I fixed that by adding the code
to output a space character. I will read and test it more carefully to
find any other problems.

I have to say, if you're under extreme time pressure to get something
working with minimal debugging effort, consider using nothing but goto.

Whether the resulting code is easy to understand may be another story.

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
int ch;

start:
ch = getchar();
if (ch == EOF)
goto end;
if (ch == '/')
goto slash;
outstart:
putchar(ch);
goto start;

slash:
ch = getchar();
if (ch == EOF)
goto end;
if (ch == '*')
goto slashstar;
if (ch == '/')
goto slashslash;
goto outstart;

slashstar:
ch = getchar();
if (ch == EOF)
goto badend;
if (ch == '*')
goto maybeend;
goto slashstar;

maybeend:
ch = getchar();
if (ch == EOF)
goto badend;
if (ch == '/') {
putchar(' ');
goto start;
}
goto slashstar;

slashslash:
ch = getchar();
if (ch == EOF)
goto badend;
if (ch == '\n')
goto outstart;
goto slashslash;

end:
return EXIT_SUCCESS;

badend:
return EXIT_FAILURE;

Kaz Kylheku

unread,
Dec 5, 2020, 12:08:41 PM12/5/20
to
On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
> On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
>> On 2020-12-05, Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:
>>> as it's sensible. Use no 'goto' statements. Limit function
>>> bodies to no more than 25 lines.
>>
>> How about versions, A and B, of this exercise:
>>
>> A - Use no goto statements.
>>
>> B - Use nothing but goto statements (and essential function calls).
>
> B attempt. Obviously, this flouts the 25 line rule which makes more
> sense for A solutions.
>
> Notes: I cranked this out almost as fast as I can type.
>
> All the time I was thinking, wow this is easy! I'm no wasting any brain
> cycles on 'how do I shoehorn this obvious flowchart into control
> structures?'"
>
> Furthermore, my code compiled the first time with no diagnostics under
> gcc -W -Wall -ansi -pedantic and had only one obvious flaw at that time:
> it completely deleted /*..*/ comments. I fixed that by adding the code
> to output a space character. I will read and test it more carefully to
> find any other problems.

Next less obvious flaw: there is a case where the machine has to output a slash
which turned out not to be the start of a comment. to be a comment. I had it
in the back of my mind when I was coding, but in the end forgot to add the
logic.

The thought occured while I was coding the goto which jumps to the slash label.
"When I code the recognizer for the comment start sequence, either /* or
//, if that doesn't match on the second character, it has to output the slash
and then the mismatching character." But when it came time to code the slash
block, I neglected to do that, being so carried away with how easy it is
to code with nothing but goto that I could hardly pause to think any more.

Patch:

diff --git a/comments.c b/comments.c
index ad648b4..07374f5 100644
--- a/comments.c
+++ b/comments.c
@@ -23,6 +23,7 @@ int main(void)
goto slashstar;
if (ch == '/')
goto slashslash;
+ putchar('/');
goto outstart;

slashstar:

Complete code:

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
int ch;

start:
ch = getchar();
if (ch == EOF)
goto end;
if (ch == '/')
goto slash;
outstart:
putchar(ch);
goto start;

slash:
ch = getchar();
if (ch == EOF)
goto end;
if (ch == '*')
goto slashstar;
if (ch == '/')
goto slashslash;
putchar('/');

Bart

unread,
Dec 5, 2020, 12:12:15 PM12/5/20
to
Did you test it? (I assume the result needs to compile as the same program.)

This turns a/b into ab.

Kaz Kylheku

unread,
Dec 5, 2020, 12:24:43 PM12/5/20
to
I caught that one; see my other post.

I have a question: do we need to recognize backslash line continuations
inside // comments, so that such a comment line can be continued past
a newline?

I didn't do it; I will check the standard either. No time now.

I will check the standard later.

> This turns a/b into ab.

Now did you know that from reading the code, or empirically from running it?

Because if you know that from the code, that's a point in favor of the
readability of goto. :)

Ben Bacarisse

unread,
Dec 5, 2020, 12:49:24 PM12/5/20
to
Kaz Kylheku <563-36...@kylheku.com> writes:

> On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
>> On 2020-12-05, Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:
>>> as it's sensible. Use no 'goto' statements. Limit function
>>> bodies to no more than 25 lines.
>>
>> How about versions, A and B, of this exercise:
>>
>> A - Use no goto statements.
>>
>> B - Use nothing but goto statements (and essential function calls).
>
> B attempt. Obviously, this flouts the 25 line rule which makes more
> sense for A solutions.
>
> Notes: I cranked this out almost as fast as I can type.

Quite a lot of bugs, though!

<cut code>
--
Ben.

Bart

unread,
Dec 5, 2020, 12:53:24 PM12/5/20
to
On 05/12/2020 17:24, Kaz Kylheku wrote:

>> Did you test it? (I assume the result needs to compile as the same program.)
>
> I caught that one; see my other post.
>
> I have a question: do we need to recognize backslash line continuations
> inside // comments, so that such a comment line can be continued past
> a newline?

Presumably the answer is yes, otherwise the new program will
compile/behave differently.

> I didn't do it; I will check the standard either. No time now.
>
> I will check the standard later.
>
>> This turns a/b into ab.
>
> Now did you know that from reading the code, or empirically from running it?
>
> Because if you know that from the code, that's a point in favor of the
> readability of goto. :)
>

I just ran the code. That's how I usually debug. After all, the program
might work first time, so that saves some trouble!

As to readability, your goto version I actually found somewhat more
readability than my clunky attempt that used state variables.

That was written in a script language, and worked string to string (and
was about 48 non-blank lines; I ignored the 25-line limit).

Then when I was going to port it to C using stdin/stdout, I found my
code had to overwrite the first "/" of "//" or "/*" that has already
been output, which I decided was to fiddly to bother with in C [my C
spec would not have stipulated stdin/stdout].

Anyway my attempts in either language would not have used goto; wasn't
that the point of the challenge? Not sure why that was proposed, but I
suspect Tim will post a compact solution that relies on gotos.

Kaz Kylheku

unread,
Dec 5, 2020, 1:30:30 PM12/5/20
to
On 2020-12-05, Bart <b...@freeuk.com> wrote:
> On 05/12/2020 17:24, Kaz Kylheku wrote:
>
>>> Did you test it? (I assume the result needs to compile as the same program.)
>>
>> I caught that one; see my other post.
>>
>> I have a question: do we need to recognize backslash line continuations
>> inside // comments, so that such a comment line can be continued past
>> a newline?
>
> Presumably the answer is yes, otherwise the new program will
> compile/behave differently.

The answer is yes, because backslash continuations (folding logical
lines into physical lines) is done in an earlier translation
phase relative to processing comments.

That's how I thought I remembered it; I just checked.

Kaz Kylheku

unread,
Dec 5, 2020, 1:34:38 PM12/5/20
to
Ah yes, slapforehead. It's completely broken.

We have to properly tokenize C code when parsing for comments, of
course, doh. Or at least semi-properly.

Otherwise we will do stupid things, like recognize /* in the middle
of string and character constants.

Bonita Montero

unread,
Dec 5, 2020, 1:41:12 PM12/5/20
to
/*, */ and // might be part of a string, and string-delimiters might
be part of escape-sequences, so this isn't as easy as you think.

Kaz Kylheku

unread,
Dec 5, 2020, 1:48:01 PM12/5/20
to
On 2020-12-05, Bonita Montero <Bonita....@gmail.com> wrote:
> /*, */ and // might be part of a string, and string-delimiters might
> be part of escape-sequences, so this isn't as easy as you think.

Fortunately, I beat you to acknowledging this essential requirement
in my reply to Ben. :)

Yes, it is completely broken.

Comments are recognized in the context of tokenization. They are a kind
of token pattern, effectively, whose semantic action is to be deleted.

We have to at least recognize string and character constants (which
could be multi-byte: 'ab\xff'.

For the purpose of the task, we might be able to get away without
classifying valid versus invalid escape sequences.

So that is to say, for instance, it could be an acceptable requirement
if the program passes through a broken literal like "abc\zdef",
rather than misbehaving or bailing.

Sometimes implementations have extensions in this area, so it's not
even such a good idea to be too rigid in a comment-removing program.

dfs

unread,
Dec 5, 2020, 1:59:31 PM12/5/20
to
Shortest code "wins"?

I'm sure you know this challenge can't provide a reliable result if
comment lines beginning with /* or // contain source code.

Do you have a 'tricky' block of code and comments to work from?



Bart

unread,
Dec 5, 2020, 2:56:40 PM12/5/20
to
FWIW I completed it and the whole program is given below.

It doesn't bother with possible comment delimiters inside '...'
literals; those are only possible with multi-character constants and
those I understand are implementation-defined. In any case it would just
be one more state on top of the 4 already handled.

Comments are entirely converted to whitespace where necessary, but
preserving newlines.

(Tested on the 219Kloc sqlite3.c plus, since the latter mostly uses
/*...*/ comments, a couple of smaller programs that use //. The new .c
files compile to the same .exe or near-identical .obj, since .obj files
include the file name which is different.)


-------------------------------------------------
#include <stdio.h>

int lastoutc=0;

void outchar(char c) {
if (lastoutc) putchar(lastoutc);
lastoutc=c;
}

int main(void) {
int inlinecomment=0;
int inblockcomment=0;
int instring=0;
int c,lastc=0;

while ((c=getchar())!=EOF) {
if (inlinecomment) {
if (c==13 || c==10) {
if (lastc!='\\') {
inlinecomment=0;
}
outchar(c);
} else {
outchar(' ');
}
} else if (inblockcomment) {
if (lastc=='*' && c=='/') {
inblockcomment=0;
}
outchar((c==13 || c==10?c:' '));
} else if (instring) {
if (c=='"') instring=0;
outchar(c);
} else {
if (lastc=='/' && c=='*') {
inblockcomment=1;
lastoutc=' ';
outchar(' ');
} else if (lastc=='/' && c=='/') {
inlinecomment=1;
lastoutc=' ';
outchar(' ');
} else if (c=='"') {
instring=1;
outchar(c);
} else {
outchar(c);
}
}
lastc=c;
}
outchar(0);
}
-------------------------------------------------


Jorgen Grahn

unread,
Dec 5, 2020, 4:37:42 PM12/5/20
to
On Sat, 2020-12-05, dfs wrote:
> On 12/05/2020 11:25 AM, Tim Rentsch wrote:
>> Prompted by some recent discussion regarding 'goto' statements and
>> state machines, I would like to propose a programming exercise.
>> (It is perhaps a bit too large to be called an exercise, but not so
>> difficult that it deserves the label of challenge. On the other
>> hand there are some constraints so maybe challenge is apropos. In
>> any case somewhere in between those two bounds.)
>>
>> Short problem statement: a C program to remove comments from C
>> source input.
>>
>> Specifics: Remove both /*...*/ and //... style comments. Don't
>> worry about trigraphs. Read from stdin, write to stdout, and
>> diagnostics (if any) go to stderr. If EOF is seen inside a
>> comment, do something sensible but it doesn't matter what as long
>> as it's sensible. Use no 'goto' statements. Limit function
>> bodies to no more than 25 lines.
>>
>> Other: feel free to handle corner cases as you see fit, as long
>> as there is some description of what choice was made.
>>
>> Hopefully it will be a fun exercise. It isn't trivial but it
>> shouldn't take too long either.
>
>
> Shortest code "wins"?

He didn't say anything about that, and code size isn't a good
indicator of anything, so I'd say "no".

It's a nice problem. (Unfortunately I'm participating in Advent of
Code already, and together with hobby projects that takes up all my
programming attention.)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

jacobnavia

unread,
Dec 5, 2020, 5:32:25 PM12/5/20
to
Here is a version full of bugs (not well tested)
---------------------------------------------------cut here
#include <stdio.h>
#include <ctype.h>
int main(int argc,char *argv[])
{
if (argc < 2) { // print usage
fprintf(stderr,"Usage: nocomments <file>\n");
return 3;
}
FILE *f=fopen(argv[1],"r");
if (f == NULL) { // Error can't open
fprintf(stderr,"Impossible to open %s\n",argv[1]);
return 4;
}
int c = fgetc(f);
while (c != EOF) { // Until the end of the file
// There are 3 characters that have a special action required:
// 1) The / character that could introduce a comment
// 2) The " character that introduces a string
// 3) the ' character that introduces a character constant
if (c == '/') { // test for a comment
c = fgetc(f);
if (c == EOF) return 2;
if (c == '/') { // this is a comment until the end of the line
while (c != '\n' && c != EOF) {
c = fgetc(f); // ignore everything
}
if (c == EOF) return 2;
putchar(c); // Put the new line char
}
else if (c == '*') { // Multi line comment
while (c != EOF) {
c = fgetc(f);
if (c == EOF) return 2;
else if (c == '*') {
c = fgetc(f);
if (c == '/') break;
}
}
if (c == EOF) return 2;
}
else { // That was a lone /
putchar('/');
putchar(c);
}
}
else if (c == '"') { // String
putchar(c);
c = fgetc(f);
if (c == EOF) return 2;
while (c != EOF && c != '"') {
putchar(c);
if (c == '\\') {
c = fgetc(f);
if (c == EOF) return 2;
putchar(c);
if (c == 'x' || c == 'X' || c == 'u' || c == 'U') {
c = fgetc(f);
while (isxdigit(c) && c != EOF) {
putchar(c);
c = fgetc(f);
}
if (c == EOF)
return 2;
}
}
c = fgetc(f);
}
putchar(c); // put the closing double quote
}
else if (c == '\'') { // Character constant
putchar(c);
c = fgetc(f);
while (c != '\'' && c != EOF) {
if (c == '\\') { // escaped char
putchar(c);
c = fgetc(f);
if (c == EOF) return 2;
putchar(c);
if (c == 'x' || c == 'X' || c == 'u' || c == 'U') {
// Not all posible character escapes are tested
here
c = fgetc(f);
while (isxdigit(c) && c != EOF) {
putchar(c);
c = fgetc(f);
}
if (c == EOF)
return 2;
}
}
else putchar(c);
c = fgetc(f);
}
if (c == EOF) return 2;
else putchar(c);
}
else putchar(c);
c = fgetc(f);
}
return 0;
}

Kaz Kylheku

unread,
Dec 5, 2020, 6:19:52 PM12/5/20
to
On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
> On 2020-12-05, Bonita Montero <Bonita....@gmail.com> wrote:
>> /*, */ and // might be part of a string, and string-delimiters might
>> be part of escape-sequences, so this isn't as easy as you think.
>
> Fortunately, I beat you to acknowledging this essential requirement
> in my reply to Ben. :)
>
> Yes, it is completely broken.
>
> Comments are recognized in the context of tokenization. They are a kind
> of token pattern, effectively, whose semantic action is to be deleted.

Improved version which has hacky quote recognition and also recognition
of backslash-newline continuations.

It handles cases like:

int x = /\
/ this is a comment
42;

where a continuation occurs in the middle of he // sequence.

Because the continuation could occur anywhere, it would complicate the
transition graph to try to handle it in a single stage. So I broke away
from the pure goto model.

Instead of getchar being called directly, a function called s2getc
is called "stage 2 getchar". This routine eats the continuations and
returns a logical character. A call to ungetc is required when it
gets a backslash not followed by a newline, in order to push back that
second character and return the backslash.

#include <stdio.h>
#include <stdlib.h>

/* "get char from translation stage 2",
* which recognizes physical lines and folds them into
* logical lines
*/
int s2getc(void)
{
int ch;

entry:
ch = getchar();
if (ch != '\\')
goto end;

ch = getchar();
if (ch == '\n')
goto entry;

if (ch == EOF) {
ch = '\\';
goto end;
}

ungetc(ch, stdin);
ch = '\\';

end:
return ch;
}

int main(void)
{
int ch;
int qc;

start:
ch = getchar();
if (ch == EOF)
goto end;
if (ch == '/')
goto slash;
if (ch == '"' || ch == '\'') {
qc = ch;
goto quote;
}
outstart:
putchar(ch);
goto start;

slash:
ch = s2getc();
if (ch == EOF)
goto end;
if (ch == '*')
goto slashstar;
if (ch == '/')
goto slashslash;
putchar('/');
goto outstart;

slashstar:
ch = s2getc();
if (ch == EOF)
goto badend;
if (ch == '*')
goto maybeend;
goto slashstar;

maybeend:
ch = s2getc();
if (ch == EOF)
goto badend;
if (ch == '/') {
putchar(' ');
goto start;
}
goto slashstar;

slashslash:
ch = s2getc();
if (ch == EOF)
goto badend;
if (ch == '\n')
goto outstart;
goto slashslash;

quote:
putchar(ch);
ch = s2getc();
if (ch == EOF)
goto badend;
if (ch == qc)
goto outstart;
if (ch == '\\')
goto bsinquote;
goto quote;

bsinquote:
putchar(ch);
ch = s2getc();
if (ch == EOF)
goto badend;
goto quote;

Bart

unread,
Dec 5, 2020, 6:56:55 PM12/5/20
to
On 05/12/2020 23:19, Kaz Kylheku wrote:
> On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
>> On 2020-12-05, Bonita Montero <Bonita....@gmail.com> wrote:
>>> /*, */ and // might be part of a string, and string-delimiters might
>>> be part of escape-sequences, so this isn't as easy as you think.
>>
>> Fortunately, I beat you to acknowledging this essential requirement
>> in my reply to Ben. :)
>>
>> Yes, it is completely broken.
>>
>> Comments are recognized in the context of tokenization. They are a kind
>> of token pattern, effectively, whose semantic action is to be deleted.
>
> Improved version which has hacky quote recognition and also recognition
> of backslash-newline continuations.
>
> It handles cases like:
>
> int x = /\
> / this is a comment
> 42;

When you talked about backslash in the middle of a // comment, I thought
you meant at the end of the comment, not literally in the middle of //!

Such a break /is/ allowed in C, and can actually be in the middle of
anything:

i\
n\
t m\
a\
i\
n void() {}

It's something I only recently learnt about but decided not to implement
[in my compiler].


A \ at the end of a // comment like this:

puts("A"); // one two\
puts("B");
puts("C");

continues the comment onto the next line, but not if it is in the middle
of the comment:

puts("A"); // one two\three


>
> /* "get char from translation stage 2",
> * which recognizes physical lines and folds them into
> * logical lines
> */
> int s2getc(void)
> {
> int ch;

<snip>

This has a problem with this input (line breaks may be inserted by
newsreader, but each /*..*/ comment was on one line):

-------------------------------------------
#define TKFLG_DONTFOLD 0x100 /* Omit constant folding
optimizations */

/************** End of parse.h
***********************************************/
/************** Continuing where we left off in sqliteInt.h
******************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stddef.h>

-------------------------------------------

The output is:

-------------------------------------------
#define TKFLG_DONTFOLD 0x100


-------------------------------------------

The #includes are elided.

(The original code is in the 219000-line version of the amalgamated
sqlite3.c, lines 13472 to 13481. In the full version, translation
continues after this comment:

/*
** Use a macro to replace memcpy() if compiled with SQLITE_INLINE_MEMCPY.
** This allows better measurements of where memcpy() is used when running
** cachegrind. But this macro version of memcpy() is very slow so it
** should not be used in production. This is a performance measurement
** hack only.
*/
#ifdef SQLITE_INLINE_MEMCPY

)

Tim Rentsch

unread,
Dec 6, 2020, 9:04:17 AM12/6/20
to
Kaz Kylheku <563-36...@kylheku.com> writes:

> On 2020-12-05, Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:
>
>> as it's sensible. Use no 'goto' statements. Limit function
>> bodies to no more than 25 lines.
>
> How about versions, A and B, of this exercise:
>
> A - Use no goto statements.

As long as version A is done, also doing a version B
is certainly welcome.

> B - Use nothing but goto statements (and essential function calls).

I take this to mean the code does not use any for's, switch's,
while's (including do/while), break's, or continue's. A program
all of whose statements are 'goto' statements isn't very
interesting.

Tim Rentsch

unread,
Dec 6, 2020, 9:13:46 AM12/6/20
to
Jorgen Grahn <grahn...@snipabacken.se> writes:

> On Sat, 2020-12-05, dfs wrote:
>
>> On 12/05/2020 11:25 AM, Tim Rentsch wrote:
>>
>>> Prompted by some recent discussion regarding 'goto' statements and
>>> state machines, I would like to propose a programming exercise.
>>> (It is perhaps a bit too large to be called an exercise, but not so
>>> difficult that it deserves the label of challenge. On the other
>>> hand there are some constraints so maybe challenge is apropos. In
>>> any case somewhere in between those two bounds.)
>>>
>>> Short problem statement: a C program to remove comments from C
>>> source input.
>>>
>>> Specifics: Remove both /*...*/ and //... style comments. Don't
>>> worry about trigraphs. Read from stdin, write to stdout, and
>>> diagnostics (if any) go to stderr. If EOF is seen inside a
>>> comment, do something sensible but it doesn't matter what as long
>>> as it's sensible. Use no 'goto' statements. Limit function
>>> bodies to no more than 25 lines.
>>>
>>> Other: feel free to handle corner cases as you see fit, as long
>>> as there is some description of what choice was made.
>>>
>>> Hopefully it will be a fun exercise. It isn't trivial but it
>>> shouldn't take too long either.
>>
>> Shortest code "wins"?
>
> He didn't say anything about that, and code size isn't a good
> indicator of anything, so I'd say "no".

Right.

> It's a nice problem.

Thank you sir. Always good to hear.

> (Unfortunately I'm participating in Advent of Code already, and
> together with hobby projects that takes up all my programming
> attention.)

Sorry to hear you won't have time for this one, but good luck
with your other efforts.

Tim Rentsch

unread,
Dec 6, 2020, 9:27:05 AM12/6/20
to
dfs <nos...@dfs.com> writes:

> On 12/05/2020 11:25 AM, Tim Rentsch wrote:
>
>> Prompted by some recent discussion regarding 'goto' statements and
>> state machines, I would like to propose a programming exercise.
>> (It is perhaps a bit too large to be called an exercise, but not so
>> difficult that it deserves the label of challenge. On the other
>> hand there are some constraints so maybe challenge is apropos. In
>> any case somewhere in between those two bounds.)
>>
>> Short problem statement: a C program to remove comments from C
>> source input.
>>
>> Specifics: Remove both /*...*/ and //... style comments. Don't
>> worry about trigraphs. Read from stdin, write to stdout, and
>> diagnostics (if any) go to stderr. If EOF is seen inside a
>> comment, do something sensible but it doesn't matter what as long
>> as it's sensible. Use no 'goto' statements. Limit function
>> bodies to no more than 25 lines.
>>
>> Other: feel free to handle corner cases as you see fit, as long
>> as there is some description of what choice was made.
>>
>> Hopefully it will be a fun exercise. It isn't trivial but it
>> shouldn't take too long either.
>
> Shortest code "wins"?

I really don't intend grading them at all. My interest is to see
how different people approach the problem, considering the stated
limitations. As far as my own reactions go, there are several
ways I might judge the quality of a program, but program length
is not high on the list of any of those.

> I'm sure you know this challenge can't provide a reliable result if
> comment lines beginning with /* or // contain source code.

It surprises me that you say that. To me it seems obvious that
comments can be removed unambiguously, because C compilers do it.

> Do you have a 'tricky' block of code and comments to work from?

Part of the problem is to figure out what needs doing, so I feel
like showing a bunch of test cases would be, at least to an
extent, giving the game away. Assuming of course that is what
you meant.. or did you mean something else?

Bart

unread,
Dec 6, 2020, 9:51:55 AM12/6/20
to
On 05/12/2020 19:56, Bart wrote:

> It doesn't bother with possible comment delimiters inside '...'
> literals; those are only possible with multi-character constants and
> those I understand are implementation-defined. In any case it would just
> be one more state on top of the 4 already handled.

>         } else if (instring) {
>             if (c=='"') instring=0;
>             outchar(c);


I didn't allow for embedded " characters inside strings, as I'd
forgotten they were allowed.

That needs to be fixed, I believe by changing that middle line to:

if (c=='"' && lastc!='\\') instring=0;

As it is, it will go wrong with strings such as:

"abc\"/*comment*/\"def"

and probably worse with only one embedded ".

jacobnavia

unread,
Dec 6, 2020, 11:18:25 AM12/6/20
to
Here is an impoved version, with more comments and comments generated by
the line splicing feature.

#include <stdio.h>
#include <ctype.h>
/* This program will eliminate comments from a C source file given in
the command
line. It returns 5 if an abnormal eof was found in a single line
comment,
6 if EOF was found in a multi-line comment, 7 if EOF appeared within
a string,
8 if EOF occurred within a character qquote sequence.
If all is OK the result is zero.
*/
int main(int argc,char *argv[])
{
if (argc < 2) { // print usage
fprintf(stderr,"Usage: nocomments <file>\n");
return 3;
}
FILE *f=fopen(argv[1],"r");
// Note that this file is never closed. We rely on the operating
system to
// close it for us.
if (f == NULL) { // Error can't open
fprintf(stderr,"Impossible to open %s\n",argv[1]);
return 4;
}
int c = fgetc(f);
while (c != EOF) { // Until the end of the file
// There are 3 characters that have a special action required:
// 1) The / character that could introduce a comment
// 2) The " character that introduces a string
// 3) the ' character that introduces a character constant
if (c == '/') { // test for a comment
c = fgetc(f);
if (c == EOF) return 5;
if (c == '/') { // this is a comment until the end of the line
while (c != '\n' && c != EOF) {
c = fgetc(f); // ignore everything
}
if (c == EOF) return 5;
putchar(c); // Put the new line char
}
else if (c == '*') { // Multi line comment
while (c != EOF) {
c = fgetc(f);
if (c == EOF) return 6;
else if (c == '*') {
c = fgetc(f);
if (c == '/') break;
}
}
if (c == EOF) return 6;
// Note that we do NOT echo the '/' since it belongs to
the comment
}
else if (c == '\\') {
c = fgetc(f);
if (c == '\n') {
c = fgetc(f);
if (c== '/') {
while (c != EOF && c != '\n') {
c = fgetc(f);
}
if (c == EOF) return 5;
}
else { putchar('\\'); }
}
putchar(c);
}
else { // That was a lone /
putchar('/');
putchar(c);
}
}
else if (c == '"') { // String
putchar(c);
c = fgetc(f);
if (c == EOF) return 2;
while (c != EOF && c != '"') {
putchar(c);
if (c == '\\') {
c = fgetc(f);
if (c == EOF) return 7;
putchar(c);
}
c = fgetc(f); // Read the next char of the string
}
if (c == EOF) return 7;
putchar(c); // put the closing double quote
}
else if (c == '\'') { // Character constant
putchar(c);
c = fgetc(f);
while (c != '\'' && c != EOF) {
if (c == '\\') { // escaped char
putchar(c);
c = fgetc(f);
if (c == EOF) return 8;
}
putchar(c); // Not an escaped char, just a normal one
c = fgetc(f);
}
if (c == EOF) return 8;
else putchar(c); // Put the closing quote
}
else putchar(c); // All other characters besides '/', '"', and
' are echoed back
c = fgetc(f); // Read next character of the source file
}
return 0; // No errors
}
#include <stdio.h>
#include <ctype.h>
/* This program will eliminate comments from a C source file given in
the command
line. It returns 5 if an abnormal eof was found in a single line
comment,
6 if EOF was found in a multi-line comment, 7 if EOF appeared within
a string,
8 if EOF occurred within a character qquote sequence.
If all is OK the result is zero.
*/
int main(int argc,char *argv[])
{
if (argc < 2) { // print usage
fprintf(stderr,"Usage: nocomments <file>\n");
return 3;
}
FILE *f=fopen(argv[1],"r");
// Note that this file is never closed. We rely on the operating
system to
// close it for us.
if (f == NULL) { // Error can't open
fprintf(stderr,"Impossible to open %s\n",argv[1]);
return 4;
}
int c = fgetc(f);
while (c != EOF) { // Until the end of the file
// There are 3 characters that have a special action required:
// 1) The / character that could introduce a comment
// 2) The " character that introduces a string
// 3) the ' character that introduces a character constant
if (c == '/') { // test for a comment
c = fgetc(f);
if (c == EOF) return 5;
if (c == '/') { // this is a comment until the end of the line
while (c != '\n' && c != EOF) {
c = fgetc(f); // ignore everything
}
if (c == EOF) return 5;
putchar(c); // Put the new line char
}
else if (c == '*') { // Multi line comment
while (c != EOF) {
c = fgetc(f);
if (c == EOF) return 6;
else if (c == '*') {
c = fgetc(f);
if (c == '/') break;
}
}
if (c == EOF) return 6;
// Note that we do NOT echo the '/' since it belongs to
the comment
}
else if (c == '\\') {
c = fgetc(f);
if (c == '\n') {
c = fgetc(f);
if (c== '/') {
while (c != EOF && c != '\n') {
c = fgetc(f);
}
if (c == EOF) return 5;
}
else { putchar('\\'); }
}
putchar(c);
}
else { // That was a lone /
putchar('/');
putchar(c);
}
}
else if (c == '"') { // String
putchar(c);
c = fgetc(f);
if (c == EOF) return 2;
while (c != EOF && c != '"') {
putchar(c);
if (c == '\\') {
c = fgetc(f);
if (c == EOF) return 7;
putchar(c);
}
c = fgetc(f); // Read the next char of the string
}
if (c == EOF) return 7;
putchar(c); // put the closing double quote
}
else if (c == '\'') { // Character constant
putchar(c);
c = fgetc(f);
while (c != '\'' && c != EOF) {
if (c == '\\') { // escaped char
putchar(c);
c = fgetc(f);
if (c == EOF) return 8;
}
putchar(c); // Not an escaped char, just a normal one
c = fgetc(f);
}
if (c == EOF) return 8;
else putchar(c); // Put the closing quote
}
else putchar(c); // All other characters besides '/', '"', and
' are echoed back
c = fgetc(f); // Read next character of the source file
}
return 0; // No errors
}

jacobnavia

unread,
Dec 6, 2020, 11:52:05 AM12/6/20
to
Sorry, and error in cut/paste and the text appears twice!

jacobnavia

unread,
Dec 6, 2020, 12:00:53 PM12/6/20
to
Le 06/12/2020 à 15:13, Tim Rentsch a écrit :
>You haven't seen my contribution I suppose?

Ben Bacarisse

unread,
Dec 6, 2020, 12:51:34 PM12/6/20
to
Tim Rentsch <tr.1...@z991.linuxsc.com> writes:

> Prompted by some recent discussion regarding 'goto' statements and
> state machines, I would like to propose a programming exercise.
> (It is perhaps a bit too large to be called an exercise, but not so
> difficult that it deserves the label of challenge. On the other
> hand there are some constraints so maybe challenge is apropos. In
> any case somewhere in between those two bounds.)
>
> Short problem statement: a C program to remove comments from C
> source input.

Apologies for non-C code but I could not resist...

In all the goto/no goto discussion it's easy to lose track of the fact
the C does not necessarily have the best primitives for constructing
this sort of program. I don't mean things like, say, complex regular
expressions, or built-in grammar parsing (al la Perl6). I mean basic
language constructs.

Here is a solution in Haskell. It is deliberately not very "Haskelly"
because I want to keep the code really simple for programmers unfamiliar
with the language.

main = do input <- getContents
putStr (stripComments (logicalLines input))

stripComments [] = []
stripComments ('/' : '/' : rest) = stripComments (dropWhile (/= '\n') rest)
stripComments ('/' : '*' : rest) = ' ' : stripComments (nestedComment rest)
stripComments (c : rest) = c : if c == '\'' || c == '"'
then skipQuote c rest
else stripComments rest

nestedComment ('*' : '/' : rest) = rest
nestedComment (c : rest) = nestedComment rest
nestedComment [] = error "Unterminated comment"

skipQuote q ('\\' : c : rest) = '\\' : c : skipQuote q rest
skipQuote q (c : rest) = c : if c == q
then stripComments rest
else skipQuote q rest
skipQuote q [] = error "Unterminated literal"

logicalLines ('\\' : '\n' : rest) = logicalLines rest
logicalLines (c : rest) = c : logicalLines rest
logicalLines [] = []

(Notes: string are lists and : is the list "cons" operation. [] is the
empty list. Function application is by juxtaposition: f x and is the
highest priority operator. Functions are defined by cases that are
written using simple pattern matching. For example, stripComments [] =
[] means return nothing when passed nothing.)

Obviously for some there will be a high language barrier, but I think
it's worth having a go at following what's happening.

Haskell has two big advantages for this exercise: pattern matched cases
and lazy evaluation. The case-by-case definition lets you say, very
clearly, how to handle each situation, and the lazy evaluation means you
can chain function calls to make processing pipelines like this:

putStr (strip (logicalLines input))

This means output the string of characters that result from stripping
comments from the logical lines, but at no time is the whole input in
memory. Characters are fetched as needed to satisfy each function call.

Originally, I just wrote stripComments, but when I decided to include
logical line processing (C's \<newline> "continuation" mechanism) it was
trivial to put it in. If I want to do trigraphs, that, too, would be a
simple addition to the pipeline. (There is even an operator to make
this simpler, but I did not want to add too many peculiar symbols.)

I am now tempted to try to write a C version that preserves as much of
what I see as the clarity and simplicity of this solution. I am not as
hopeful as I'd like to be about that!

--
Ben.

dfs

unread,
Dec 6, 2020, 1:03:44 PM12/6/20
to
Thanks for that Haskell lesson!

I'm working on a C program that lists .csv files, then imports and
summarizes them, lets you search for values in fields, etc. How well
does Haskell lend itself to that kind of thing?




Bart

unread,
Dec 6, 2020, 2:53:58 PM12/6/20
to
I followed it enough to try and recreate the approach in my script language.

However the first real input I gave it, I got a stack overflow after 500
lines.

----------------------------------
https://github.com/sal55/langs/blob/master/comments.q

(Not lazily evaluated. tail/drop are like the Haskell versions. Tested -
after I increased the stack size - on a 900-line module, which appeared
to have the comments elided, and which compiled. But the executable
showed some differences from the original.

But I don't know well your version would have worked as I can't test it.

I haven't bothered with C's intra-token line continuation, as it's rare
a program actually makes use of it.)
----------------------------------

>
> Haskell has two big advantages for this exercise: pattern matched cases
> and lazy evaluation. The case-by-case definition lets you say, very
> clearly, how to handle each situation, and the lazy evaluation means you
> can chain function calls to make processing pipelines like this:
>
> putStr (strip (logicalLines input))
>
> This means output the string of characters that result from stripping
> comments from the logical lines, but at no time is the whole input in
> memory. Characters are fetched as needed to satisfy each function call.

It also encourages recursive algorithms which as I found can cause problems.

The pattern matching is rather nice. But if it it doesn't work, I'd
imagine it's harder to debug.

Tim Rentsch

unread,
Dec 6, 2020, 2:59:06 PM12/6/20
to
Sjouke Burry <burrynu...@ppllaanneett.nnll> writes:

> On 05.12.20 17:25, Tim Rentsch wrote:
>
>> [statement of problem]
>
> If you show yours , maybe some of us will show theirs........

I'm interested to see what people come up with on their own, and
posting some of my code early would interfere with that. So I
would like to wait until there are two or more submissions that
at least make an effort to follow the guidelines. At some point
I expect I will time out and post something anyway even if there
are no submissions, but I don't have any fixed time in mind for
when that would be.

By the way, the guidelines are not arbitrary. I think everyone
understands that the problem can be solved with a very large
function with lots of goto's, or a very large function with a big
switch() statement in a while() loop. Both goto's and large
functions have been argued against in the earlier thread about
goto's and state machines. Part of my reason for suggesting this
problem (and with the guidelines as stated) is to see how trying
to follow that advice might affect a solution to this kind of
problem.

Tim Rentsch

unread,
Dec 6, 2020, 3:32:06 PM12/6/20
to
jacobnavia <ja...@jacob.remcomp.fr> writes:

> You haven't seen my contribution I suppose?

I see that you have posted two different versions, and have
dutifully made copies of both, but have not yet looked at either
in detail. The same holds for other peoples' postings also.

Bart

unread,
Dec 6, 2020, 5:27:26 PM12/6/20
to
On 06/12/2020 16:18, jacobnavia wrote:
> Here is an impoved version, with more comments and comments generated by
> the line splicing feature.
>
> #include <stdio.h>
> #include <ctype.h>
> /* This program will eliminate comments from a C source file given in
> the command

<snip>

Strangely this has the same issue as Kaz's version. Given this input:

----------------------------------------------------------
#define TKFLG_DONTFOLD 0x100 /* Omit constant folding
optimizations */

/************** End of parse.h
***********************************************/
/************** Continuing where we left off in sqliteInt.h
******************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stddef.h>

/*
** Use a macro to replace memcpy() if compiled with SQLITE_INLINE_MEMCPY.
** This allows better measurements of where memcpy() is used when running
** cachegrind. But this macro version of memcpy() is very slow so it
** should not be used in production. This is a performance measurement
** hack only.
*/
#ifdef SQLITE_INLINE_MEMCPY
----------------------------------------------------------

The output is:

----------------------------------------------------------
#define TKFLG_DONTFOLD 0x100



#ifdef SQLITE_INLINE_MEMCPY
----------------------------------------------------------

(Full file: https://www.sqlite.org/download.html, first source code link.)

Ben Bacarisse

unread,
Dec 6, 2020, 6:39:12 PM12/6/20
to
Bart <b...@freeuk.com> writes:
<cut>
>> Haskell has two big advantages for this exercise: pattern matched cases
>> and lazy evaluation. The case-by-case definition lets you say, very
>> clearly, how to handle each situation, and the lazy evaluation means you
>> can chain function calls to make processing pipelines like this:
>>
>> putStr (strip (logicalLines input))
>>
>> This means output the string of characters that result from stripping
>> comments from the logical lines, but at no time is the whole input in
>> memory. Characters are fetched as needed to satisfy each function call.
>
> It also encourages recursive algorithms which as I found can cause
> problems.

That's a bit muddled. You reported a stack overflow with a small input.
Recursive /algorithms/ don't cause such problems, language
implementations do.

> The pattern matching is rather nice. But if it it doesn't work, I'd
> imagine it's harder to debug.

Harder than what? I haven't seen an example posted so far that I'd
rather debug than this one.

--
Ben.

Ben Bacarisse

unread,
Dec 6, 2020, 6:53:38 PM12/6/20
to
dfs <nos...@dfs.com> writes:

> I'm working on a C program that lists .csv files, then imports and
> summarizes them, lets you search for values in fields, etc. How well
> does Haskell lend itself to that kind of thing?

Not sure how to answer that sort of question and it's going to lead to
more off-topic discussion. There is comp.lang.haskell. It's a bit
quiet, but maybe it could be revived by a question just like this one!

--
Ben.

Bart

unread,
Dec 6, 2020, 7:17:52 PM12/6/20
to
On 06/12/2020 23:38, Ben Bacarisse wrote:
> Bart <b...@freeuk.com> writes:
> <cut>
>>> Haskell has two big advantages for this exercise: pattern matched cases
>>> and lazy evaluation. The case-by-case definition lets you say, very
>>> clearly, how to handle each situation, and the lazy evaluation means you
>>> can chain function calls to make processing pipelines like this:
>>>
>>> putStr (strip (logicalLines input))
>>>
>>> This means output the string of characters that result from stripping
>>> comments from the logical lines, but at no time is the whole input in
>>> memory. Characters are fetched as needed to satisfy each function call.
>>
>> It also encourages recursive algorithms which as I found can cause
>> problems.
>
> That's a bit muddled. You reported a stack overflow with a small input.
> Recursive /algorithms/ don't cause such problems, language
> implementations do.

You need an implementation to run the code.

My version of your algorithm appears to use a maximum call depth
(counting only calls to stripcomments, skipquote, nestedcomment) equal
to the number of generated characters in the output.

With the default stack size of my language, which suffices for most
things including running the non-recursive version on inputs up to 8MB,
it could only run this on inputs of 15KB (output of 7KB as it collapses
some comments to a space).

Practicality matters!

>> The pattern matching is rather nice. But if it it doesn't work, I'd
>> imagine it's harder to debug.
>
> Harder than what? I haven't seen an example posted so far that I'd
> rather debug than this one.

I wouldn't know where to put my print statements. I find recursive code
difficult anyway.


Ben Bacarisse

unread,
Dec 6, 2020, 9:10:05 PM12/6/20
to
Bart <b...@freeuk.com> writes:

> On 06/12/2020 23:38, Ben Bacarisse wrote:
>> Bart <b...@freeuk.com> writes:
>> <cut>
>>>> Haskell has two big advantages for this exercise: pattern matched cases
>>>> and lazy evaluation. The case-by-case definition lets you say, very
>>>> clearly, how to handle each situation, and the lazy evaluation means you
>>>> can chain function calls to make processing pipelines like this:
>>>>
>>>> putStr (strip (logicalLines input))
>>>>
>>>> This means output the string of characters that result from stripping
>>>> comments from the logical lines, but at no time is the whole input in
>>>> memory. Characters are fetched as needed to satisfy each function call.
>>>
>>> It also encourages recursive algorithms which as I found can cause
>>> problems.
>>
>> That's a bit muddled. You reported a stack overflow with a small input.
>> Recursive /algorithms/ don't cause such problems, language
>> implementations do.
>
> You need an implementation to run the code.

Yes. Yours appears to be unsuitable. That's not a criticism -- it's
not designed for this sort of code. Haskell implementations are (or
should be).

> My version of your algorithm appears to use a maximum call depth
> (counting only calls to stripcomments, skipquote, nestedcomment) equal
> to the number of generated characters in the output.

It's almost certainly not the same algorithm since your language is
probably not lazy. There's a grey area about what makes two bits of
code be "the same algorithm" so this is probably a rabbit hole we don't
need to go down.

> With the default stack size of my language, which suffices for most
> things including running the non-recursive version on inputs up to
> 8MB, it could only run this on inputs of 15KB (output of 7KB as it
> collapses some comments to a space).
>
> Practicality matters!

Yes. Apparently you should not write the sort of code you wrote in your
language (at least as it is currently implemented).

--
Ben.

jacobnavia

unread,
Dec 7, 2020, 3:37:54 AM12/7/20
to
I corrected that. The problem is that if I read a '*', I read the next
character to see if it is a '/' that would close the multi-line comment.
If it is NOT, I ignore the character and just go on reading the next
character. If the next character is a '/' I ignore it too since it
wasn't after a '*', that I discarded!

Solution: Just make an ungetc. Corrected version is posted soon.

Tim Rentsch

unread,
Dec 7, 2020, 4:04:17 AM12/7/20
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Tim Rentsch <tr.1...@z991.linuxsc.com> writes:
>
>> Prompted by some recent discussion regarding 'goto' statements and
>> state machines, I would like to propose a programming exercise.
>> (It is perhaps a bit too large to be called an exercise, but not so
>> difficult that it deserves the label of challenge. On the other
>> hand there are some constraints so maybe challenge is apropos. In
>> any case somewhere in between those two bounds.)
>>
>> Short problem statement: a C program to remove comments from C
>> source input.
>
> Apologies for non-C code but I could not resist...
>
> In all the goto/no goto discussion its easy to lose track of the fact
> [..some explanation of Haskell language..]

It's a nice program, but it doesn't quite do what was asked for.
The problem statement says to remove comments. This program does
remove comments, but it also takes out line splices to replace
multiple physical lines with their corresponding logical line.
Line continuation sequences outside of comments should be
preserved in the output.

> Haskell has two big advantages for this exercise: pattern matched
> cases and lazy evaluation. [..explanation..]

Also one not mentioned: 'input <- getContents'. An analogous
functionality in C would read the contents of a file into a
character array, which would make the C program easier.

> Originally, I just wrote stripComments, but when I decided to
> include logical line processing (C's \<newline> "continuation"
> mechanism) it was trivial to put it in. [...]

Line continuation sequences have to be processed if the original
problem statement is to be met. What was done here, with the
function 'logicalLines', is easy, but it changes the problem
being solved.

> I am now tempted to try to write a C version that preserves as much
> of what I see as the clarity and simplicity of this solution. I am
> not as hopeful as I'd like to be about that!

I am a fan of Haskell as you know, and I enjoyed reading your
program here. However, what I think would be more in keeping
with the spirit of the requested problem is to write in Haskell
something that is more like what might be done in a C program.
Output might be done functionally, producing a list as a result,
but input should be one character at a time, like what would
happen in a normal C program running a state machine. Doing that
would make the program more accessible to those people who are
experienced C programmers but lack experience in Haskell or
functional programming. That strikes me as a better fit for a
comp.lang.c posting.

Of course, independent of that, I also would like to see your try
at a C program as the problem was originally asked. So I hope we
will see one or both of those suggestions from you sometime soon.

Ben Bacarisse

unread,
Dec 7, 2020, 7:05:36 AM12/7/20
to
Tim Rentsch <tr.1...@z991.linuxsc.com> writes:

> It's a nice program, but it doesn't quite do what was asked for.
> The problem statement says to remove comments. This program does
> remove comments, but it also takes out line splices to replace
> multiple physical lines with their corresponding logical line.
> Line continuation sequences outside of comments should be
> preserved in the output.

Ah, I didn't see that in the specification. Given the phases described
in the standard, I implemented what I guessed was the expected
behaviour.

> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
<cut>
>> Haskell has two big advantages for this exercise: pattern matched
>> cases and lazy evaluation. [..explanation..]
>
> Also one not mentioned: 'input <- getContents'. An analogous
> functionality in C would read the contents of a file into a
> character array, which would make the C program easier.

I don't think it's analogous at all! 'input' <- getContents does no IO.
Input is fetched only as needed to perform computations on 'input' and
processed data will be considered garbage.

> Of course, independent of that, I also would like to see your try
> at a C program as the problem was originally asked. So I hope we
> will see one or both of those suggestions from you sometime soon.

Preserving uncommented line continuations makes the problem seem a
little contrived to me. I'm not tempted to have a go.

--
Ben.

Bart

unread,
Dec 7, 2020, 7:25:51 AM12/7/20
to
The approach I used was to turn everything normally part of a comment,
into a space, except for new lines inside /*...*/ comments (which I
implemented as being un-nested; your program seems to assume that they
nest).

That should mean the output being the same size as the input.

It doesn't touch \ line continuations, other than '// abc \' comments
needing to continue to a subsequent line. Here the \ is turned into a
space too, with the rest of the comment.

It doesn't deal with \ line continuation inside //:

... /\
/ comment

And it assumes single-character 'x' constants (otherwise it would have
to treat them like strings in case comment delimiters are inside).

The spec did say to make your own decisions on corner cases.

Richard Damon

unread,
Dec 7, 2020, 7:36:49 AM12/7/20
to
Been thinking a bit of the problem, and this shows one of the traps in
it. We are asked to remove the comments but the lines ending in \ being
connected together aren't in general comments, so unless they are inside
comments need to be retained, so we can't just remove them at an inner
level or we are doing an transformation that wasn't in the spec.

Malcolm McLean

unread,
Dec 7, 2020, 8:15:35 AM12/7/20
to
On Monday, 7 December 2020 at 12:05:36 UTC, Ben Bacarisse wrote:
>
> Preserving uncommented line continuations makes the problem seem a
> little contrived to me. I'm not tempted to have a go.
>
That's a classic gotcha that occurs all the time in industrial programming.

The client asks for something, maybe tries to specify it formally or maybe
just asks for it informally, then finds that whilst the result does most of
what he wants, there are a few cases which it doesn't handle properly.
It might be technically what he asked for, or it might be something that
wasn't considered until the first candidate program was delivered. Either
way, he's paying the bills.

The program structure often deteriorates as you try to accommodate
these requests without starting from scratch.

Ben Bacarisse

unread,
Dec 7, 2020, 8:33:30 AM12/7/20
to
Bart <b...@freeuk.com> writes:

> The approach I used was to turn everything normally part of a comment,
> into a space, except for new lines inside /*...*/ comments (which I
> implemented as being un-nested; your program seems to assume that they
> nest).

No, it does what C expects (unless I've got bugs in there I've not
spotted).

> It doesn't touch \ line continuations, other than '// abc \' comments
> needing to continue to a subsequent line. Here the \ is turned into a
> space too, with the rest of the comment.
>
> It doesn't deal with \ line continuation inside //:
>
> ... /\
> / comment

There are other issues with them as well of course. Dealing with them
whilst leaving them in (when uncommented) seems unnatural to me given
then description of the compiler's phases in the standard.

--
Ben.

Ben Bacarisse

unread,
Dec 7, 2020, 8:42:23 AM12/7/20
to
Yes. This is a big part of my complaint to Bart that using a goto is
often the simplest solution. As accumulated "simple" changes go, adding
in a goto is one of the most risky.

--
Ben.

Bart

unread,
Dec 7, 2020, 9:19:14 AM12/7/20
to
On 07/12/2020 13:33, Ben Bacarisse wrote:
> Bart <b...@freeuk.com> writes:
>
>> The approach I used was to turn everything normally part of a comment,
>> into a space, except for new lines inside /*...*/ comments (which I
>> implemented as being un-nested; your program seems to assume that they
>> nest).
>
> No, it does what C expects (unless I've got bugs in there I've not
> spotted).

Well, you did call the function nestedComment!

>
>> It doesn't touch \ line continuations, other than '// abc \' comments
>> needing to continue to a subsequent line. Here the \ is turned into a
>> space too, with the rest of the comment.
>>
>> It doesn't deal with \ line continuation inside //:
>>
>> ... /\
>> / comment
>
> There are other issues with them as well of course. Dealing with them
> whilst leaving them in (when uncommented) seems unnatural to me given
> then description of the compiler's phases in the standard.
>

If you mean needing to deal with /\/ (with a line break after \) in
order to recognise the // comment, but leaving ma\in unchanged, that
does seem at odds with how C is normally processed.

But this is isn't normal processing, it's removing comments while
preserving as much as possible of non-comments.

So I think it is reasonable.

(Except I don't personally agree with splitting tokens across lines, so
will not implement it. Anyone who splits "//" across two, or possibly N
lines, should be fired.

It's not even a feature that would allow arbitrary insertion of
\<newline> anywhere (eg. inside string literals, or just before an
existing \<newline>) so I can't see the point.

I guess it was a side-effect of some early implementation, that for some
reason has been perpetuated.)

Bart

unread,
Dec 7, 2020, 9:31:53 AM12/7/20
to
On 07/12/2020 14:18, Bart wrote:

> (Except I don't personally agree with splitting tokens across lines, so
> will not implement it. Anyone who splits "//" across two, or possibly N
> lines, should be fired.
>
> It's not even a feature that would allow arbitrary insertion of
> \<newline> anywhere (eg. inside string literals, or just before an
> existing \<newline>) so I can't see the point.

I keep forgetting that \ would be followed by <newline>, so maybe it
/is/ possible to allow it everywhere, even put every character of a
source file on its own line.

I still think it is not necessary to allow tokens to be split across lines.

Actually it's funny that C allows so much freedom with \ given that,
unlike many languages, C hardly ever needs to use \ to continue code
into a following line. (I think only for macro bodies; the //abc\ case
is a misfeature.)

james...@alumni.caltech.edu

unread,
Dec 7, 2020, 10:03:43 AM12/7/20
to
On Monday, December 7, 2020 at 4:04:17 AM UTC-5, Tim Rentsch wrote:
> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
> > Tim Rentsch <tr.1...@z991.linuxsc.com> writes:
> >
> >> Prompted by some recent discussion regarding 'goto' statements and
> >> state machines, I would like to propose a programming exercise.
> >> (It is perhaps a bit too large to be called an exercise, but not so
> >> difficult that it deserves the label of challenge. On the other
> >> hand there are some constraints so maybe challenge is apropos. In
> >> any case somewhere in between those two bounds.)
> >>
> >> Short problem statement: a C program to remove comments from C
> >> source input.
...
> It's a nice program, but it doesn't quite do what was asked for.
> The problem statement says to remove comments. This program does
> remove comments, but it also takes out line splices to replace
> multiple physical lines with their corresponding logical line.
> Line continuation sequences outside of comments should be
> preserved in the output.

The standard never uses the term "line continuation". The standard does refer
to "a backslash character (\) immediately followed by a new-line character",
but doesn't name that construct. They are often referred to outside the
standard as "escaped newlines". It does talk about "splicing" and "logical
lines" in connection with escaped newlines, so I would presume that this is
what you're referring to?

C comments exist only in the context described by the standard. If it isn't in
that context, it isn't properly a C comment, it's a C-like comment. That
context includes the translation phases. Translation phase 2 is removal of line
continuations. Translation phase 3 includes removal of comments. The
translation phases are not required to actually occur in the order specified,
but it is required that the end result be the same as if they did occur in that
order. Therefore, it doesn't make much sense to me to describe a program as
removing C comments, if it's also supposed to preserve some, but not all
escaped newlines.

I could understand a program that implements translation phase 2 and part of
translation phase 3, removing first the escaped newlines, and then removing the
comments. It also makes sense to me to define a program that implements
only part of translation phase 3, with the built-in assumption that the input
has already had all escaped newlines removed (with the corresponding
possibility of malfunction if that assumption is incorrect). I don't see a
good motivation for a program that does what you describe.

> > Haskell has two big advantages for this exercise: pattern matched
> > cases and lazy evaluation. [..explanation..]
>
> Also one not mentioned: 'input <- getContents'. An analogous
> functionality in C would read the contents of a file into a
> character array, which would make the C program easier.
> > Originally, I just wrote stripComments, but when I decided to
> > include logical line processing (C's \<newline> "continuation"
> > mechanism) it was trivial to put it in. [...]
>
> Line continuation sequences have to be processed if the original
> problem statement is to be met.

Line continuations were not mentioned in the original problem description.
Handling of line continuations can be considered to have been implied only if
you want them to be handled as described by the standard. What you want is
different from what the standard specifies; as such, you should have said so
explicitly.

Malcolm McLean

unread,
Dec 7, 2020, 10:13:14 AM12/7/20
to
Yes don't think anyone ever sat down and said "shall we allow comment
delimiter tokens to be split across logical line continuations ?". Rather it was
decided that logical line continuation would be a low-level, first pass operation.

This has the effect of creating a difficult special case for programs which
strip comments but preserve line continuation tokens. One option is to
simply ignore the split token case. But this means shipping a deliberate bug,
and something which could potentially be exploited. That might or might not
matter, it depends who is using the program. It's unlikely that a non-malicious
human user would split a comment token. On the other hand, if code has been
put through a fairly primitive "max line length" automatic filter, there might be
a few odd non-contrived cases.

Richard Damon

unread,
Dec 7, 2020, 12:58:23 PM12/7/20
to
I think it came about as C was (is?) a common 'output' language for many
code generators, and coupled with fixed 80 column screens, having a
fixed output width of the code was important, so being able to just
split the code anywhere was useful.

Combined with the simplistic multi-phase processing, it wasn't that hard
to do, so it became the 'standard', and thus was put into the Standard.

Anton Shepelev

unread,
Dec 7, 2020, 4:27:20 PM12/7/20
to
Tim Rentsch:

> If EOF is seen inside a comment, do something sensible but
> it doesn't matter what as long as it's sensible.

Is just stopping the program not sensible, e.g.:

void main( void ) {}
/* commenging...

may output

void main( void ) {}

and call it run?

--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

luser droog

unread,
Dec 7, 2020, 4:44:53 PM12/7/20
to
On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
> Prompted by some recent discussion regarding 'goto' statements and
> state machines, I would like to propose a programming exercise.
> (It is perhaps a bit too large to be called an exercise, but not so
> difficult that it deserves the label of challenge. On the other
> hand there are some constraints so maybe challenge is apropos. In
> any case somewhere in between those two bounds.)
>
> Short problem statement: a C program to remove comments from C
> source input.
>
> Specifics: Remove both /*...*/ and //... style comments. Don't
> worry about trigraphs. Read from stdin, write to stdout, and
> diagnostics (if any) go to stderr. If EOF is seen inside a
> comment, do something sensible but it doesn't matter what as long
> as it's sensible. Use no 'goto' statements. Limit function
> bodies to no more than 25 lines.
>
> Other: feel free to handle corner cases as you see fit, as long
> as there is some description of what choice was made.
>
> Hopefully it will be a fun exercise. It isn't trivial but it
> shouldn't take too long either.

Rough translation of Ben Bacarisse's Haskell program to C. The logic
could probably be made nicer with a pattern matching construct.
But the way I would do that would require another type for Symbols.

strpcom.c:

//#define DEBUG(...) __VA_ARGS__
#define DEBUG(...)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef union uobject *object;
typedef object fSuspension( object );
typedef object list;
typedef enum tag { INVALID, INTEGER, LIST, SUSPENSION, VOID } tag;

union uobject { tag t;
struct { tag t; int i; } Int;
struct { tag t; object a, b; } List;
struct { tag t; object v; fSuspension *f; } Suspension;
struct { tag t; void *v; } Void;
};


list skip_quote( list o );
list nested_comment( list o );


#define OBJECT(...) new_( (union uobject[]){{ __VA_ARGS__ }} )
object new_( object a ){
object p = calloc( 1, sizeof *p );
return p ? *p = *a, p : 0;
}


object Int( int i ){ return OBJECT( .Int = { INTEGER, i } ); }

list cons( object a, object b ){ return OBJECT( .List = { LIST, a, b } ); }
list one( object a ){ return cons( a, NULL ); }
object car( list o ){ return o && o->t == LIST ? o->List.a : NULL; }
object cdr( list o ){ return o && o->t == LIST ? o->List.b : NULL; }

object Void( void *v ){ return OBJECT( .Void = { VOID, v } ); }

object Suspension( void *v, fSuspension *f ){
return OBJECT( .Suspension = { SUSPENSION, v, f } );
}

int eq( object a, object b ){
return !a && !b ? 1 :
!a || !b ? 0 :
a->t != b->t ? 0 :
!memcmp( a, b, sizeof *a ) ? 1 : 0;
}

object force( object o ){
return !o || o->t != SUSPENSION ? o :
force( o->Suspension.f( o->Suspension.v ) );
}

list take( int i, list o ){
return i == 0 ? NULL :
!o ? NULL :
( *o = *force( o ),
cons( car( o ), take( i-1, cdr( o ) ) ) );
}
list drop( int i, list o ){
return i == 0 ? o :
!o ? NULL :
( *o = *force( o ),
drop( i-1, cdr( o ) ) );
}

static list
force_chars_from_file( object file ){
FILE *f = file->Void.v;
int c = fgetc( f );
return c != EOF ? cons( Int( c ), Suspension( file, force_chars_from_file ) )
: cons( Int( EOF ), NULL );
}

list chars_from_file( FILE *f ){
return f ? Suspension( Void( f ), force_chars_from_file ) : NULL;
}

list logical_lines( list o ){
object a = take( 1, o );
if( eq( car( a ), Int( '\\' ) ) ){
object as = drop( 1, o );
object b = take( 1, as );
if( eq( car( b ), Int( '\n' ) ) ){
return Suspension( drop( 1, as ), logical_lines );
} else {
return cons( a, Suspension( as, logical_lines ) );
}
} else {
return !eq( car( a ), Int( EOF ) )
? cons( car( a ), Suspension( drop( 1, o ), logical_lines ) )
: one( car( a ) );
}
}

list strip_comments( list o ){
object a = take( 1, o );
DEBUG( fprintf( stderr, "a='%c'\n", car( a )->Int.i ); )
if( eq( car( a ), Int( '/' ) ) ){
object as = drop( 1, o );
object b = take( 1, as );
DEBUG( fprintf( stderr, "b='%c'\n", car( b )->Int.i ); )
if( eq( car( b ), Int( '/' ) ) ) {
do {
as = drop( 1, as );
b = take( 1, as );
DEBUG( fprintf( stderr, "b='%c'\n", car( b )->Int.i ); )
} while( !eq( car( b ), Int( '\n' ) ) );
return cons( car( b ), Suspension( drop( 1, as ), strip_comments ) );
} else if( eq( car( b ), Int( '*' ) ) ){
return cons( Int(' '), nested_comment( drop( 1, as ) ) );
} else {
return cons( car( a ), Suspension( as, strip_comments ) );
}
} else {
if( eq( car( a ), Int( '\'' ) ) || eq( car( a ), Int( '"' ) ) ){
return skip_quote( o );
}
return !eq( car( a ), Int( EOF ) )
? cons( car( a ), Suspension( drop( 1, o ), strip_comments ) )
: one( car( a ) );
}
}

list nested_comment( list o ){
object a = take( 1, o );
if( eq( car( a ), Int( '*' ) ) ){
object as = drop( 1, o );
object b = take( 1, as );
if( eq( car( b ), Int( '/' ) ) ){
return drop( 1, as );
} else {
return nested_comment( as );
}
} else {
if( eq( car( a ), Int( EOF ) ) ){
fprintf( stderr, "Unterminated literal\n" );
exit( 1 );
}
return nested_comment( drop( 1, o ) );
}
}

list skip_quote( list o ){
object q = car( o );
o = cdr( o );
object a = take( 1, o );
if( eq( car( a ), Int( '\\' ) ) ){
return cons( car( a ),
cons( car( drop( 1, o ) ),
skip_quote( cons( q, drop( 2, o ) ) ) ) );
} else if( eq( car( a ), q ) ){
return cons( car( a ),
strip_comments( drop( 1, o ) ) );
}
return cons( car( a ),
skip_quote( cons( q, drop( 1, o ) ) ) );
}

void print( list o ){
switch( o ? o->t : 0 ){
default: break;
case INTEGER: if( o->Int.i != EOF ) fputc( o->Int.i, stdout ); break;
case SUSPENSION:
case LIST: print( car( take( 1, o ) ) );
print( drop( 1, o ) ); break;
}
}


int main(){
list input = chars_from_file( stdin );
print( strip_comments( logical_lines( input ) ) );
}


$ cat test
strip //comments
line \
continue
// comment \
continue
/* nested /* comment
*/
"/*quoted*/"
eof
$ ./strpcom < test
strip
line continue


"/*quoted*/"
eof

Ben Bacarisse

unread,
Dec 7, 2020, 4:56:12 PM12/7/20
to
Bart <b...@freeuk.com> writes:

> On 07/12/2020 13:33, Ben Bacarisse wrote:
>> Bart <b...@freeuk.com> writes:
>>
>>> The approach I used was to turn everything normally part of a comment,
>>> into a space, except for new lines inside /*...*/ comments (which I
>>> implemented as being un-nested; your program seems to assume that they
>>> nest).
>>
>> No, it does what C expects (unless I've got bugs in there I've not
>> spotted).
>
> Well, you did call the function nestedComment!

Yes, that's a bad name. The "nested" had nothing to do with the
comments nesting (I just meant "inside") but it gives the wrong
impression altogether.

--
Ben.

luser droog

unread,
Dec 7, 2020, 5:02:02 PM12/7/20
to
On Monday, December 7, 2020 at 3:44:53 PM UTC-6, luser droog wrote:
> On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
> > Short problem statement: a C program to remove comments from C
> > source input.

> Rough translation of Ben Bacarisse's Haskell program to C. The logic
> could probably be made nicer with a pattern matching construct.
> But the way I would do that would require another type for Symbols.
>
> strpcom.c:
>
<snip>

Probably needs to be used with the Boehm gc. The code leaks memory
like a firehose.

Ben Bacarisse

unread,
Dec 7, 2020, 5:16:59 PM12/7/20
to
luser droog <luser...@gmail.com> writes:

> On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
<cut>
>> Short problem statement: a C program to remove comments from C
>> source input.
<cut>
> Rough translation of Ben Bacarisse's Haskell program to C.

Ingenious. Sorry I led you astray by misinterpreting the specification!

Before I learned about that I was considering something a bit like this,
but I was hoping for less "machinery". Not a criticism -- I'm not at
all sure one can get away with less.

One thing pops into my head though. Do you need to separate car/cdr
from take? At first glance, I'd be tempted to make car and cdr force
any suspension node to a list node. I think it would tidy up some of
code and I don't think you lose any laziness. Thoughts?

<cut code>
--
Ben.

Anton Shepelev

unread,
Dec 7, 2020, 5:59:46 PM12/7/20
to
Tim Rentsch:

> Short problem statement: a C program to remove comments
> from C source input.
>
> Specifics: Remove both /*...*/ and //... style comments.
> Don't worry about trigraphs. Read from stdin, write to
> stdout, and diagnostics (if any) go to stderr. If EOF is
> seen inside a comment, do something sensible but it
> doesn't matter what as long as it's sensible. Use no
> 'goto' statements. Limit function bodies to no more than
> 25 lines.

I started writing Wirth-like tokeniser, became tired of it
and swithced to an explicite state machine, producing the
dumbest entry in this public exercise. It is too late, I
have to work in the morning, so am posting my program right
now, without extensive testing. Come, I do not even know C
well enough to construct the tests.

By the way, am I the only one to be hard-pressed to keep the
functions within the 25-line limit, for botjh of my
functions have exactly 25 lines. I hope you have an
insecticide handy. Here it goes:

/* ------------------------------ No comments ------------------------------- *
/* A program to strip comments from a C file, by Anton Shepelev */


/* Machine states: */
/* TODO: devise a consistent, annotated, but still very brief naming: */
typedef enum t_state
{ S_CD, S_BC, S_CR, S_SR, S_LC,
S_CM1, S_BC0, S_CE, S_SE, } t_state;

/* A state transition: */
/* TODO: buf is used in one entry only => hardcode: */
typedef struct
{ t_state state; /* state */
int c; /* input */
t_state next; /* next state */
char buf; /* buffer input? */
char print; /* print input? */
} t_trans;

/* Transition table: */
/* EOF input denotes the fall-back: */
t_trans xtab[] = {
/* state char next buffer print */
{ S_CD, '/' , S_CM1, 1, 0 },
{ S_CD, '"' , S_SR, 0, 1 },
{ S_CD, '\'', S_CR, 0, 1 },
{ S_CD, EOF , S_CD, 0, 1 },
{ S_CM1, '/' , S_LC, 0, 0 },
{ S_CM1, '*' , S_BC, 0, 0 },
{ S_CM1, EOF , S_CD, 0, 1 },
{ S_LC, '\n', S_CD, 0, 1 },
{ S_LC, EOF , S_LC, 0, 0 },
{ S_BC, '*' , S_BC0, 0, 0 },
{ S_BC, EOF , S_BC, 0, 0 },
{ S_BC0, '/' , S_CD, 0, 0 },
{ S_BC0, '*' , S_BC0, 0, 0 }, /* <== thanks to Bart's test case */
{ S_BC0, EOF , S_BC, 0, 0 },
{ S_SR, '\\', S_SE, 0, 0 },
{ S_SR, '"' , S_CD, 0, 1 },
{ S_SR, EOF , S_SR, 0, 1 },
{ S_SE, EOF, S_SR, 0, 1 },
{ S_CR, '\\', S_CE, 0, 0 },
{ S_CR, '\'', S_CD, 0, 1 },
{ S_CR, EOF, S_CR, 0, 1 },
{ S_CE, EOF, S_CR, 0, 0 }
};

/* Perform a step on the state machine: */
/* OPT: nested array, indexed by 1) .state and 2) .char to avoid the loop: */
/* i.e.: trans = state_tab[state][c] */
void step( t_state *state, int c, int *buf )
{ t_trans *trans;
int i;

i = 0;
while( 1 ) /* transition look up */
{ trans = &xtab[i];
i = i + 1;

if( trans->state != *state ) continue;
if( trans->c != c )
if( trans->c != EOF ) continue;

break;
}

if( trans->print )
{ if( *buf != EOF ) putchar( *buf );
putchar( c ); }

if( trans->buf ) *buf = c;
else *buf = EOF;

*state = trans->next;
}

void main( void )
{ int c, p, buf, stop;
char cont;
t_state state;

stop = 0;
state = S_CD;
buf = EOF;
c = EOF;
while( 1 )
{ p = c; c = getc( stdin );

if( p == '\\' && c == '\n' )
{ cont = 1; continue; }

if( cont )
{ cont = 0; continue; }

if( p != EOF )
{ step( &state, p, &buf ); }

if( c == EOF )
{ break; }

luser droog

unread,
Dec 7, 2020, 6:11:06 PM12/7/20
to
Yes. Good idea. It looks like car and take are always used together, or
rather, whenever there's a take() there's inevitably a car() to make use
of the result. So I could factor out a lot of car()s by making a combined
first() function (better name?).

I'm thinking about how I might add pattern matching as simply as possible.
In the parser combinator code, I have a `list match( list pats, list a )`
function but it needs two kinds of Symbols (a "matching" symbol and
a "variable capturing" symbol) and an Operator type to pack actions
compactly into the patterns.
(https://github.com/luser-dr00g/pcomb/blob/master/pc9fp.c#L246)
Plus, it multplies the naming problem because each little clause needs its
function since there aren't a lambdas.

And then there's the garbage leaking problem... Maybe a tiny region
allocator algo, and a copying collector to get a clean list in a new self
contained region? A mark/sweep collector would need 80-100 lines I think.

Anton Shepelev

unread,
Dec 7, 2020, 6:16:37 PM12/7/20
to
Tim Rentsch:

> It's a nice program, but it doesn't quite do what was
> asked for. The problem statement says to remove comments.
> This program does remove comments, but it also takes out
> line splices to replace multiple physical lines with their
> corresponding logical line. Line continuation sequences
> outside of comments should be preserved in the output.

I had not realised that, and agree with James Kuyper:

> C comments exist only in the context described by the
> standard. If it isn't in that context, it isn't properly a
> C comment, it's a C-like comment. That context includes
> the translation phases. Translation phase 2 is removal of
> line continuations. Translation phase 3 includes removal
> of comments. The translation phases are not required to
> actually occur in the order specified, but it is required
> that the end result be the same as if they did occur in
> that order. Therefore, it doesn't make much sense to me to
> describe a program as removing C comments, if it's also
> supposed to preserve some, but not all escaped newlines.

Such details as line continuations are more low-level than
comments. They belong to the physical, "transport" level of
the file, where the the distinction between code and
comments is not yet available. Removing comments while
keeping line continuations is turning the natural order of
processing upside down. If you are not convinced, I will
consider amending my program tomorrow evening.

Richard Damon

unread,
Dec 7, 2020, 7:02:34 PM12/7/20
to
On 12/5/20 11:25 AM, Tim Rentsch wrote:
> Prompted by some recent discussion regarding 'goto' statements and
> state machines, I would like to propose a programming exercise.
> (It is perhaps a bit too large to be called an exercise, but not so
> difficult that it deserves the label of challenge. On the other
> hand there are some constraints so maybe challenge is apropos. In
> any case somewhere in between those two bounds.)
>
> Short problem statement: a C program to remove comments from C
> source input.
>
> Specifics: Remove both /*...*/ and //... style comments. Don't
> worry about trigraphs. Read from stdin, write to stdout, and
> diagnostics (if any) go to stderr. If EOF is seen inside a
> comment, do something sensible but it doesn't matter what as long
> as it's sensible. Use no 'goto' statements. Limit function
> bodies to no more than 25 lines.
>
> Other: feel free to handle corner cases as you see fit, as long
> as there is some description of what choice was made.
>
> Hopefully it will be a fun exercise. It isn't trivial but it
> shouldn't take too long either.
>

A few thoughts on some things that I see that many entries aren't handling:

Strings (and character constants):

The following are some valid strings:
""

"\""

"\\"

"\\\""

"foo\
bar"


An arbitrary number of \<nl> combinations can occur between the two
characters that define the start/end of the comment so the following is
a valid block comment


/\
\
\
* comment *\
\
\
/

Note also, that the spec doesn't say to remove \<nl> combinations that
are not in a comment so something like this could be tricky to get right:


a = b /\
\
\
\
\
c;


Note also (and I think most get this:

// This is a multi-line \
single line comment



Bart

unread,
Dec 7, 2020, 7:35:14 PM12/7/20
to
This is my C version of your Haskell code (via a couple of my langages).
It's a bit simpler than that Lispy one, sorry.

(I have a habit of simplying things which means the result may not be so
interesting. However both static versions worked pretty much first time.)

This one managed to convert a couple of small programs that then compile
to the same executable. But being highly recursive, it still has a
problem with nesting depth (luser's version did too).

AFAICT this version is 'lazy'.

(Code to deal with //abc\ comments has been commented out, which seems
appropriate, as it had a bug.)

-------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>

int peekchar;
void stripcomments(void);
void skipquote(int);

int nextchar(void) {
int c=peekchar;
peekchar=getchar();
if (peekchar==EOF) peekchar=0;

return c;
}

void dropline(void) {
int lastc,c;
do {
// lastc=c;
c=nextchar();
// } while (lastc!='\\' && c!=13 && c!=10);
} while (c!=13 && c!=10);
putchar('\n');
}

void skipblockcomment(void) {
int c=nextchar();
if (c=='*' && peekchar=='/') {
nextchar();
return;
}
if (c==0) {puts("Bad comment"); exit(1);}
skipblockcomment();
}

void skipquote(int q) {
int c=nextchar();
putchar(c);
if (c=='\\') {
putchar(nextchar());
skipquote(q);
} else if (c==q) {
stripcomments();
} else {
skipquote(q);
}
}

void stripcomments(void) {
int c=nextchar();
if (c==0) return;

if (c=='/' && peekchar=='/' ) {
nextchar();
putchar('\n');
dropline();
} else if (c=='/' && peekchar=='*' ) {
nextchar();
putchar(' ');
skipblockcomment();
} else {
putchar(c);
if (c=='"' || c=='\'') {
skipquote(c);
return;
}
}
stripcomments();
}

int main(void) {
nextchar();
stripcomments();
}
-------------------------------------------------------

Ben Bacarisse

unread,
Dec 7, 2020, 8:08:03 PM12/7/20
to
luser droog <luser...@gmail.com> writes:

> On Monday, December 7, 2020 at 4:16:59 PM UTC-6, Ben Bacarisse wrote:
>> luser droog <luser...@gmail.com> writes:
>>
>> > On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
>> <cut>
>> >> Short problem statement: a C program to remove comments from C
>> >> source input.
>> <cut>
>> > Rough translation of Ben Bacarisse's Haskell program to C.
>> Ingenious. Sorry I led you astray by misinterpreting the specification!
>>
>> Before I learned about that I was considering something a bit like this,
>> but I was hoping for less "machinery". Not a criticism -- I'm not at
>> all sure one can get away with less.
>>
>> One thing pops into my head though. Do you need to separate car/cdr
>> from take? At first glance, I'd be tempted to make car and cdr force
>> any suspension node to a list node. I think it would tidy up some of
>> code and I don't think you lose any laziness. Thoughts?
>>
>
> Yes. Good idea. It looks like car and take are always used together, or
> rather, whenever there's a take() there's inevitably a car() to make use
> of the result. So I could factor out a lot of car()s by making a combined
> first() function (better name?).

Haskell uses head and tail (but the tail is always a list, never an
atom).

The point I was making is the neither car nor cdr make sense if the
argument has not been forced to a list already, so you might as well
ensure that's happened in the functions themselves.

> I'm thinking about how I might add pattern matching as simply as possible.
> In the parser combinator code, I have a `list match( list pats, list a )`
> function but it needs two kinds of Symbols (a "matching" symbol and
> a "variable capturing" symbol) and an Operator type to pack actions
> compactly into the patterns.
> (https://github.com/luser-dr00g/pcomb/blob/master/pc9fp.c#L246)
> Plus, it multplies the naming problem because each little clause needs its
> function since there aren't a lambdas.
>
> And then there's the garbage leaking problem... Maybe a tiny region
> allocator algo, and a copying collector to get a clean list in a new self
> contained region? A mark/sweep collector would need 80-100 lines I
> think.

To paraphrase a well-know remark: you always end up implementing Lisp.

--
Ben.

Ben Bacarisse

unread,
Dec 7, 2020, 8:15:30 PM12/7/20
to
Richard Damon <Ric...@Damon-Family.org> writes:

> A few thoughts on some things that I see that many entries aren't handling:
>
> Strings (and character constants):
>
> The following are some valid strings:
> ""
>
> "\""
>
> "\\"
>
> "\\\""
>
> "foo\
> bar"

and my favourite:

"abc\\
""

--
Ben.

Richard Damon

unread,
Dec 7, 2020, 8:38:18 PM12/7/20
to
Yes. That is a tricky one to get, got to do the phases right.

Tim Rentsch

unread,
Dec 7, 2020, 8:49:12 PM12/7/20
to
Tim Rentsch <tr.1...@z991.linuxsc.com> writes:

> Prompted by some recent discussion regarding 'goto' statements and
> state machines, I would like to propose a programming exercise.
> (It is perhaps a bit too large to be called an exercise, but not so
> difficult that it deserves the label of challenge. On the other
> hand there are some constraints so maybe challenge is apropos. In
> any case somewhere in between those two bounds.)
>
> Short problem statement: a C program to remove comments from C
> source input.
>
> Specifics: Remove both /*...*/ and //... style comments. Don't
> worry about trigraphs. Read from stdin, write to stdout, and
> diagnostics (if any) go to stderr. If EOF is seen inside a
> comment, do something sensible but it doesn't matter what as long
> as it's sensible. Use no 'goto' statements. Limit function
> bodies to no more than 25 lines.
>
> Other: feel free to handle corner cases as you see fit, as long
> as there is some description of what choice was made.
>
> Hopefully it will be a fun exercise. It isn't trivial but it
> shouldn't take too long either.

I see there has been a fair amount of activity. Also some
questions and some confusions, so I am prompted to give some
clarifications.

The program is to remove (see below) C comments from a C source
file input, and nothing else. An input with no C comments in it
should be transmitted unchanged (provided its compile-time
behavior is defined, see below). To give an obvious example,
a multi-line macro definition that uses \ at the end of lines
to continue the definition (but has no comments) should appear
in the output exactly as in the input.

The remark about "something sensible" for EOF might be phrased as
anything that doesn't violate The Law of Least Astonishment.
Simply stopping output is okay, either with or without an error
return or diagnostic.

The statement about corner cases is meant to apply to compile-time
undefined behavior (meaning, in the input), and nothing else. Any
input whose compile-time specification has defined behavior,
including implementation-defined behavior, should be processed
correctly. If an input has compile-time undefined behavior, do
something reasonable (like in the previous paragraph), but it
doesn't matter so much what exactly as long as the user is given
some idea of what that will be.

An important property is that the program should transform working
programs into the same working programs. In other word compiling
an input before it is transformed should give the same semantics
as compiling it after it is transformed. There is an exception to
this rule in cases where the C pre-processor stringize operator is
used. In some cases removing a comment and doing nothing else
cannot be done because doing so will change the meaning of the
program. To deal with that problem, it is allowed to put in a
space for a removed comment. Note, putting in a space is allowed
but not required, as long as the input program semantics (not
counting stringize) is unchanged. Hence it is permissible for
programs that use the stringize operator to exhibit different
behavior before and after being transformed. Except for that
though the output semantics should match the input semantics.

Re: removing comments. This might be done by removing comments
entirely (and putting in a single space in some cases), or it
might be done by replacing comments with some "filler" white
space to, for example, preserve line numbers. Undoubtedly it
is easier to simply take the comments out and put in a single
space in their place; more elaborate replacements are allowed
as long as the output has no comments and preserves the program
semantics of the original input.

Part of my motivation in offering the exercise is to see how
people handle a non-trivial "state machiney" kind of program
without using goto statements and without using "big" functions.
The choice of 25 lines is meant to codify that, but it isn't meant
as an exact hard limit - maybe five lines over (to accommodate
style differences) is okay, ten lines over is pushing it.

Incidentally, the problem statement isn't something I just made up
for the newsgroup, but is a simplified version of a utility
program that is used as part of a larger toolkit. Of course I
knew this when I first posted the problem, so I had a more fixed
idea than the original problem statement presented clearly, giving
rise to the resulting confusions etc. I'm sorry the original
problem wasn't stated more clearly, and I hope the statements here
clear up any remaining uncertainties. Please, please, ask about
something if there is any doubt about what is meant.

Richard Damon

unread,
Dec 7, 2020, 9:04:01 PM12/7/20
to
One piece of implemention defined behavior that you need to allow not
handling is the case of \<sp><nl> (where <sp> is a space character and
<nl> is a new line character)

The issue is that whether the following line is a continuation of this
line replacing the \) or not is implementaiton defined, but the program
is going to need to make a firm decision on this.

This case is a special case in the standard because of the flexibilty of
handling speces at the end of lines in text files in C. It is allowed
for implementations to add or remove trailing spaces at the end of
lines, due to how some file systems deal with lines.

Basically the Standard allows the character sequence /\<sp><nl>* to
either be the beginning of a block comment or not based on the
definition of the implementation.

Ben Bacarisse

unread,
Dec 7, 2020, 9:19:24 PM12/7/20
to
Actually the underlying logic of the phases is simple. I think the
Haskell code showed that.

--
Ben.

Ben Bacarisse

unread,
Dec 7, 2020, 9:22:27 PM12/7/20
to
Tim Rentsch <tr.1...@z991.linuxsc.com> writes:

> Incidentally, the problem statement isn't something I just made up
> for the newsgroup, but is a simplified version of a utility
> program that is used as part of a larger toolkit.

Which utility is that?

--
Ben.

Kaz Kylheku

unread,
Dec 7, 2020, 9:26:47 PM12/7/20
to
On 2020-12-05, Kaz Kylheku <563-36...@kylheku.com> wrote:
> int main(void)
> {
> int ch;
> int qc;
>
> start:
> ch = getchar();

This is a bug; this getchar should have been replaced by s2getc, like
the others.

Bart

unread,
Dec 8, 2020, 6:44:35 AM12/8/20
to
I've said before, this is a crazy thing to allow in a language (how do
you even specify it in a grammar?).

All you need are // and /**/ comments and be done with it, without
allowing that whole extra dimension.

But today I thought I'd allow it in my C compiler anyway, as a special
option. Except that I found I'd already done it!

It was available as a long-forgotten built-in flag. So I just made it a
regular compiler option (-splicing). It involves copying and translating
one string to another, which 99.9999% of the time will be a complete
waste of time.

The 0.0001% is for threads like these.

Richard Damon

unread,
Dec 8, 2020, 7:33:16 AM12/8/20
to
C does it by specifying a sequence of stages that ar done. In C comments
aren't part of the main language 'grammer', but get processed out before
we get to the normal grammer phase.

Note, that C also really has two fairly distince grammers, one for the
preprocessor and seperate one for the main language constructs.
>
> All you need are // and /**/ comments and be done with it, without
> allowing that whole extra dimension.

It actually comes out very simply by the way the langague is defined,
and it provides a very usefull capability for machine generated code
with line length constraints.

Anton Shepelev

unread,
Dec 8, 2020, 8:04:55 AM12/8/20
to
Kaz Kylheku:

> This is a bug;

For the benefit of whoever may be collecting the entries, I
ask you to post full updated versions of your solutions, in
addition to, indicating bugs. It is much easier to copy-
paste a program form the newsreader than manually to apply
the indicated changes the previous copy, which is also
error-prone.

--
Anton Shepelev

james...@alumni.caltech.edu

unread,
Dec 8, 2020, 9:39:35 AM12/8/20
to
You don't. It is specified it in the translation phases. The purpose of the
translation phases is to establish "The precedence among the syntax rules of
translation ..." The relevant rules are:

Translation phase 2:
"Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source
lines. Only the last backslash on any physical source line shall be eligible
for being part of such a splice. A source file that is not empty shall end in a
new-line character, which shall not be immediately preceded by a backslash
character before any such splicing takes place."

Translation phase 3:
"... A source file shall not end in a partial preprocessing token or in a
partial comment. Each comment is replaced by one space character. ..."

The C standard actually contains two different grammars, the lexical grammar,
which is summarized in A.1, which applies at the start translation phase 4, and
the phrase structure grammar which is summarized in A.2, which applies at the
end of translation phase 7.

> All you need are // and /**/ comments and be done with it, without
> allowing that whole extra dimension.
>
> But today I thought I'd allow it in my C compiler anyway, as a special
> option. Except that I found I'd already done it!
>
> It was available as a long-forgotten built-in flag. So I just made it a
> regular compiler option (-splicing). It involves copying and translating
> one string to another, which 99.9999% of the time will be a complete
> waste of time.
>
> The 0.0001% is for threads like these.

Special cases like the ones shown above are not of any importance to anyone.
They are just a natural side-effect of correctly implementing translation phase
2. The main use of line splicing is to cope with the fact that pre-processing
directives all are terminated by a new-line characters. Since pre-processing
occurs during translation phase 4, the escaped new-lines don't count. That
allows you to create multi-line pre-preprocessing directives, which is often
necessary.

If you implement translation phase 2 correctly, in order to enable multi-line
preprocessing directives, it's interaction with comments will be handled
correctly, without having to do anything special about that interaction.

Bart

unread,
Dec 8, 2020, 9:40:54 AM12/8/20
to
On 08/12/2020 12:32, Richard Damon wrote:
> On 12/8/20 6:44 AM, Bart wrote:

>> I've said before, this is a crazy thing to allow in a language (how do
>> you even specify it in a grammar?).
>
> C does it by specifying a sequence of stages that ar done. In C comments
> aren't part of the main language 'grammer', but get processed out before
> we get to the normal grammer phase.
>
> Note, that C also really has two fairly distince grammers, one for the
> preprocessor and seperate one for the main language constructs.
>>
>> All you need are // and /**/ comments and be done with it, without
>> allowing that whole extra dimension.
>
> It actually comes out very simply by the way the langague is defined,
> and it provides a very usefull capability for machine generated code
> with line length constraints.

Yet, I've only heard of it in connection with C. No other language that
I know of does this. How did /they/ (and do they) manage without such a
feature?!

james...@alumni.caltech.edu

unread,
Dec 8, 2020, 9:52:22 AM12/8/20
to
On Tuesday, December 8, 2020 at 9:40:54 AM UTC-5, Bart wrote:
> On 08/12/2020 12:32, Richard Damon wrote:
...
> > It actually comes out very simply by the way the langague is defined,
> > and it provides a very usefull capability for machine generated code
> > with line length constraints.
> Yet, I've only heard of it in connection with C. No other language that
> I know of does this. How did /they/ (and do they) manage without such a
> feature?!

I've heard of several languages that provide one mechanism or another for
splicing multiple lines together into one logical line, all of them klunky. In
older versions of Fortran, for instance, a '1' in a particular position near the
start of a line marked it as a continuation of the previous line. Such a
feature is needed only for languages for which new-line is an explicit part
of the syntax for some language constructs that might need to be much
longer than can fit on a single physical line. In C, that applies to most pre-
processor directives.

Richard Damon

unread,
Dec 8, 2020, 10:45:48 AM12/8/20
to
The key is that C is one of the few languages that was used as an output
target in environments with strict line lengths.

(I forget if the card format based FORTRAN allowed the continuation card
(with the character in column 6) would allow breaking tokens that went
to column 72 and continued in column 7)

Several parts of C are to support this sort of usage.

Ben Bacarisse

unread,
Dec 8, 2020, 12:17:10 PM12/8/20
to
Bart <b...@freeuk.com> writes:

> On 08/12/2020 12:32, Richard Damon wrote:
>> On 12/8/20 6:44 AM, Bart wrote:
>
>>> I've said before, this is a crazy thing to allow in a language (how do
>>> you even specify it in a grammar?).
>>
>> C does it by specifying a sequence of stages that ar done. In C comments
>> aren't part of the main language 'grammer', but get processed out before
>> we get to the normal grammer phase.
>>
>> Note, that C also really has two fairly distince grammers, one for the
>> preprocessor and seperate one for the main language constructs.
>>>
>>> All you need are // and /**/ comments and be done with it, without
>>> allowing that whole extra dimension.
>>
>> It actually comes out very simply by the way the langague is defined,
>> and it provides a very usefull capability for machine generated code
>> with line length constraints.

In the old days, it was also useful for sending C in emails. Some
systems had line length restrictions, so a simple method for breaking lines was a
handy feature.

> Yet, I've only heard of it in connection with C. No other language
> that I know of does this. How did /they/ (and do they) manage without
> such a feature?!

Fortran has it's own somewhat fussier version. But no one is saying
it's essential.

Strange file formats and line length limitations are rare these days so
newer languages are not likely to even consider such a feature.

--
Ben.

Bart

unread,
Dec 8, 2020, 12:32:11 PM12/8/20
to
On 08/12/2020 14:52, james...@alumni.caltech.edu wrote:
> On Tuesday, December 8, 2020 at 9:40:54 AM UTC-5, Bart wrote:
>> On 08/12/2020 12:32, Richard Damon wrote:
> ...
>>> It actually comes out very simply by the way the langague is defined,
>>> and it provides a very usefull capability for machine generated code
>>> with line length constraints.
>> Yet, I've only heard of it in connection with C. No other language that
>> I know of does this. How did /they/ (and do they) manage without such a
>> feature?!
>
> I've heard of several languages that provide one mechanism or another for
> splicing multiple lines together into one logical line, all of them klunky.

Yes, that is common.

What isn't so common is allowing that to happen between any two
characters, even in the middle of a token, rather than between two tokens.


In
> older versions of Fortran, for instance, a '1' in a particular position near the
> start of a line marked it as a continuation of the previous line.

Fortan was rather peculiar (probably still is) in ignoring spaces where
they are needed to separate tokens as in conventional syntax.

Having a continuation logic be applied before tokenisation wouldn't
raise eyebrows as much; it would be different anyway as it's marked at
the start of the next line (eg. next punched card) rather than at the
end of the line being continued.

jfbod...@gmail.com

unread,
Dec 8, 2020, 2:31:06 PM12/8/20
to
On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
> Prompted by some recent discussion regarding 'goto' statements and
> state machines, I would like to propose a programming exercise.
> (It is perhaps a bit too large to be called an exercise, but not so
> difficult that it deserves the label of challenge. On the other
> hand there are some constraints so maybe challenge is apropos. In
> any case somewhere in between those two bounds.)
>
> Short problem statement: a C program to remove comments from C
> source input.
>
> Specifics: Remove both /*...*/ and //... style comments. Don't
> worry about trigraphs. Read from stdin, write to stdout, and
> diagnostics (if any) go to stderr. If EOF is seen inside a
> comment, do something sensible but it doesn't matter what as long
> as it's sensible. Use no 'goto' statements. Limit function
> bodies to no more than 25 lines.
>
> Other: feel free to handle corner cases as you see fit, as long
> as there is some description of what choice was made.
>
> Hopefully it will be a fun exercise. It isn't trivial but it
> shouldn't take too long either.

Here's my pathetic contribution. Table-driven state machine, trades memory for nested switch logic.
Knocked it out in about 45 minutes or so - given time I could probably think of a way to minimize
the memory footprint (probably with a sparse matrix), but I'm not willing to spend that much time on
it.

----
#include <stdio.h>

/**
* Strip comments from an input stream, write the result to an output stream.
* Leading/trailing whitespace is not compressed.
*
* Uses a table-driven state machine. States are:
*
* TEXT - not currently in a comment, output the character just read
* SLASH - saw what may be the beginning of an open comment token, do not output
* SLASH_END - saw something other than a * or / following a slash, output a / followed by the current character
* MULTI_STAR - saw what may be the beginning of a close multi-line comment token, do not output
* MULTI - currently in a multi-line comment, do not output the last character read
* MULTI_END - saw a complete close multi-line comment token, do not output
* SINGLE - currently in a single-line comment, do not output
*
* Table is indexed by state and character, giving us the new state.
*/
void stripComments( FILE *in, FILE *out )
{
int c;

enum state { TEXT, SLASH, SLASH_END, MULTI_STAR, MULTI, MULTI_END, SINGLE, NUM_STATES } curState = TEXT;
/**
* Initialize the state table so that everything transitions to TEXT - this is going to be
* true for almost all combinations.
*/
enum state transitions[NUM_STATES][256] = { { TEXT } };

/**
* If we're in a MULTI or SINGLE comment state, only a small
* set of characters bring us out of that state, so we initialize
* those rows to keep us in the comment state for all characters.
*/
for ( size_t i = 0; i < 256; i++ )
{
transitions[MULTI][i] = MULTI;
transitions[MULTI_STAR][i] = MULTI;
transitions[SINGLE][i] = SINGLE;
transitions[SLASH][i] = SLASH_END;
}

/**
* Set up the transitions to new states.
*/
transitions[MULTI]['*'] = MULTI_STAR;
transitions[MULTI_STAR]['/'] = MULTI_END;
transitions[TEXT]['/'] = SLASH;
transitions[SLASH]['/'] = SINGLE;
transitions[SLASH]['*'] = MULTI;
transitions[SINGLE]['\n'] = TEXT;

/**
* Read from the input stream until we see EOF. Output
* anything that's not in a comment.
*/
while ( ( c = fgetc( in ) ) != EOF )
{
curState = transitions[curState][c];

switch( curState )
{
/**
* This state is a bit of a hack. Since a single /
* may be the beginning of an open comment token, we
* don't want to write it to the output stream until
* we're sure it isn't part of a comment, but we don't
* know that until we read the next character. So we
* have this special state to tell us to write a /
* before writing the character we last read.
*/
case SLASH_END:
fputc( '/', out );
fputc( c, out );
break;

case TEXT:
fputc( c, out );
break;

default:
break;
}
}
}

int main( int argc, char **argv )
{
stripComments( stdin, stdout );

return 0;
}
----

Kaz Kylheku

unread,
Dec 8, 2020, 3:16:48 PM12/8/20
to
On 2020-12-08, Bart <b...@freeuk.com> wrote:
> On 08/12/2020 14:52, james...@alumni.caltech.edu wrote:
>> On Tuesday, December 8, 2020 at 9:40:54 AM UTC-5, Bart wrote:
>>> On 08/12/2020 12:32, Richard Damon wrote:
>> ...
>>>> It actually comes out very simply by the way the langague is defined,
>>>> and it provides a very usefull capability for machine generated code
>>>> with line length constraints.
>>> Yet, I've only heard of it in connection with C. No other language that
>>> I know of does this. How did /they/ (and do they) manage without such a
>>> feature?!
>>
>> I've heard of several languages that provide one mechanism or another for
>> splicing multiple lines together into one logical line, all of them klunky.
>
> Yes, that is common.
>
> What isn't so common is allowing that to happen between any two
> characters, even in the middle of a token, rather than between two tokens.

In TXR Lisp, I only allow continuation in he middle of a token!
Specifically, a string-literal token. There is no general line
continuation feature.

This is the TXR Lisp interactive listener of TXR 244.
Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
TXR is made with equipment not contaminated by peanuts ... r-r-right?
1> "foo bar \
baz"
"foo barbaz"

Note how all he whitespace surrounding the \ is deleted!

2> "foo bar \
baz"
"foo barbaz"

This allows continued literals to obey the indentation.

A space can be escaped to prevent its deletion, using one of these
patterns:

2> "foo bar \
\ baz"
"foo bar baz"
3> "foo bar\ \
baz"
"foo bar baz"

I could have imposed the requirement that literals can simply continue
across lines without a backslash.

But that doesn't sit well with me.

The first reason is this:

Most literals are closed in the same line, and so in most cases
when the closing delimiter is not seen in the same line, it is an
unintentional error situation. We would like to diagnose it promptly.

Furthermore, there is a risk that we will not diagnose it at all:

(blah sym "forgotten close
sym 123 stray quote follows" blah)

Naively allowing multi-line literals is a poorly considered misfeature.
(Which is why it finds an excellent home in the POSIX shell language).

The second reason is that I would like to be able to continue a literal
without the newline becoming part of the data. If you allow this:

"abcdef
ghi"

such that it simply deletes all he whitespace and produces "abcdefghi",
that will surprise users coming from other languages.

Combining adjacent literals like in C is a complete nonstarter in a
Lisp. ("abc" "def") must be a different object from ("abcdef").

Kaz Kylheku

unread,
Dec 8, 2020, 3:31:56 PM12/8/20
to
On 2020-12-08, jfbod...@gmail.com <jfbod...@gmail.com> wrote:
> Here's my pathetic contribution. Table-driven state machine, trades
> memory for nested switch logic.

Take heart; a table-driven machine could easily use *less* memory than a
representation of the same state machine in code. That code
representation requires additional instructions in proportion to the
size of the state machine, and those could be using an inefficient encoding
compared to a compact table.

This could matter. There could be situations where your table
dispatcher fits entirely into the instruction cache, and the associated
table data into the L1 cache, whereas solutions based on code blow the
instruction cache.

Bart

unread,
Dec 8, 2020, 3:48:44 PM12/8/20
to
On 08/12/2020 20:16, Kaz Kylheku wrote:
> On 2020-12-08, Bart <b...@freeuk.com> wrote:

>> What isn't so common is allowing that to happen between any two
>> characters, even in the middle of a token, rather than between two tokens.
>
> In TXR Lisp, I only allow continuation in he middle of a token!
> Specifically, a string-literal token. There is no general line
> continuation feature.
>
> This is the TXR Lisp interactive listener of TXR 244.
> Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
> TXR is made with equipment not contaminated by peanuts ... r-r-right?
> 1> "foo bar \
> baz"
> "foo barbaz"

Python does something similar, allows \ in the middle of a string, but
the white space at the start of the next line is part of the string, so
can't be used for indentation.

(Otherwise, \ is only allowed between tokens.)


> Note how all he whitespace surrounding the \ is deleted!
>
> 2> "foo bar \
> baz"
> "foo barbaz"
>
> This allows continued literals to obey the indentation.

> A space can be escaped to prevent its deletion, using one of these
> patterns:
>
> 2> "foo bar \
> \ baz"
> "foo bar baz"
> 3> "foo bar\ \
> baz"
> "foo bar baz"

What about requiring the continuation of the string to start with "?
Although that would be tidier if both parts had ":

"foo bar" \
"baz"

Then \ has the effect also of splicing those two tokens.


> I could have imposed the requirement that literals can simply continue
> across lines without a backslash.
>
> But that doesn't sit well with me.
>
> The first reason is this:
>
> Most literals are closed in the same line, and so in most cases
> when the closing delimiter is not seen in the same line, it is an
> unintentional error situation. We would like to diagnose it promptly.
>
> Furthermore, there is a risk that we will not diagnose it at all:
>
> (blah sym "forgotten close
> sym 123 stray quote follows" blah)
>
> Naively allowing multi-line literals is a poorly considered misfeature.
> (Which is why it finds an excellent home in the POSIX shell language).
>
> The second reason is that I would like to be able to continue a literal
> without the newline becoming part of the data. If you allow this:
>
> "abcdef
> ghi"
>
> such that it simply deletes all he whitespace and produces "abcdefghi",
> that will surprise users coming from other languages.

Allowing real newlines belonging to the source file into the string is
not good.

My current scheme if literals have to be combined is one of:

"abc" "def" # forms one token

"abc" \ # without \, it's "abc" ; "def"
"def"

"abc" + "def"

"abc" + # binary op means line continues
"def"

The "+" seems obvious but I never got round to it until just now. It's a
more grown-up way of doing it, and allows either operand to be a named
constant, or any expression that reduces to a constant string.


> Combining adjacent literals like in C is a complete nonstarter in a
> Lisp. ("abc" "def") must be a different object from ("abcdef").

You're not tempted with an infix operator like my + and \ examples?

(+ "abc" "def") will work too.


jacobnavia

unread,
Dec 8, 2020, 4:15:57 PM12/8/20
to
Le 08/12/2020 à 20:30, jfbod...@gmail.com a écrit :

Hi
You have to eliminate text in character strings and character constants
first. Your program takes char *p = "/*"; as the start of a comment

jacobnavia

unread,
Dec 8, 2020, 4:17:23 PM12/8/20
to
In the case of my solution, the total size of my code is 1K, small
enough to fit in any instruction cache of modern machines.

jfbod...@gmail.com

unread,
Dec 8, 2020, 4:28:53 PM12/8/20
to
Bummer. I may fix that at some point, I may not.

Lew Pitcher

unread,
Dec 8, 2020, 4:38:38 PM12/8/20
to
Same issue, but with multibyte character constants, as in
int p = '/*';




--
Lew Pitcher
"In Skills, We Trust"

Malcolm McLean

unread,
Dec 8, 2020, 6:34:16 PM12/8/20
to
Here's my offering.

#include <stdio.h>
#include <stdlib.h>

void skipblockcomment(FILE *fpin, FILE *fpout)
{
int prevch = -1;
int ch;

fputc(' ', fpout);

while( (ch = fgetc(fpin)) != EOF)
{
if (ch == '/' && prevch == '*')
return;
if (ch == '\\')
if (fgetc(fpin) == '\n')
continue;

prevch = ch;
}

fprintf(stderr, "EOF in block comment\n");
exit(EXIT_FAILURE);
}

void skiplinecomment(FILE *fpin, FILE *fpout)
{
int prevch = -1;
int ch;

fputc('\n', fpout);

while ((ch = fgetc(fpin)) != EOF)
{
if (ch == '\n' && prevch != '\\')
return;
prevch = ch;
}
}

void outputstring(FILE *fpin, FILE *fpout)
{
int prevch = -1;
int ch;

fputc('\"', fpout);

while ((ch = fgetc(fpin)) != EOF)
{
fputc(ch, fpout);
if (ch == '\"' && prevch != '\\')
return;
if (ch == '\\' && prevch == '\\')
prevch = -1;
else
prevch = ch;
}
fprintf(stderr, "Unterminated string literal\n");
exit(EXIT_FAILURE);
}

void flushbuffer(int *buff, int N, FILE *fp)
{
int i;
for (i= 0; i < N; i++)
{
if (buff[i] != -1)
fputc(buff[i], fp);
buff[i] = -1;
}
}

void shiftbuffer(int *buff, int N, int ch, FILE *fp)
{
int i;

if (buff[0] != -1)
fputc(buff[0], fp);
for (i =0; i < N-1; i++)
buff[i] = buff[i+1];
buff[N-1] = ch;
}

void clearbuffer(int *buff, int N)
{
int i;

for (i =0; i < N; i++)
buff[i] = -1;
}

void stripcomments(FILE *fpin, FILE *fpout)
{
int ch;
int buff[3] = {-1, -1, -1};

while ((ch = fgetc(fpin)) != EOF)
{
if (ch == '*' && buff[2] == '/')
skipblockcomment(fpin, fpout);
else if (ch == '/' && buff[2] == '/')
skiplinecomment(fpin, fpout);
else if (ch == '"' && buff[2] != '\\')
{
flushbuffer(buff, 3, fpout);
outputstring(fpin, fpout);
}
else if(ch == '*' && buff[2] == '\n' && buff[1] == '\\' && buff[0] == '/')
{
skipblockcomment(fpin, fpout);
}
else if(ch == '/' && buff[2] == '\n' && buff[1] == '\\' && buff[0] == '/')
{
skiplinecomment(fpin, fpout);
}
else
{
shiftbuffer(buff, 3, ch, fpout);
continue;
}
clearbuffer(buff, 3);
}

flushbuffer(buff, 3, fpout);
}


void usage(void)
{
printf("strips comments form a C file\n");
printf("Usage: stripcomments <source.c>\n");
exit(EXIT_FAILURE);
}

int main(int argc, char **argv)
{
FILE *fp;

if (argc != 2)
usage();
fp = fopen(argv[1], "r");
if (!fp)
{
fprintf(stderr, "Error opening file %s\n", argv[1]);
exit(EXIT_FAILURE);
}

stripcomments(fp, stdout);
fclose(fp);

return 0;
}

You see that the code is dominated by the line continuation / split
tokens problem. That's the reason for the use of a three character buffer,
the buffer routines,a nd the complexity in the high-level function.

Anton Shepelev

unread,
Dec 8, 2020, 6:39:54 PM12/8/20
to
I wrote to Tim:

> If you are not convinced, I will consider amending my
> program tomorrow evening.

Since he has kept the rules, I have modified my program.
It is still inefficient, but hopefully more correct:

/* ------------------------------ No comments -----------------------------V2-*/
/* A program to strip comments from a C file, by Anton Shepelev */

#include<stdio.h>
#include<string.h>

/* Machine states: */
/* TODO: devise a consistent, annotated, but still very brief naming: */
typedef enum t_state
{ S_CD, S_BC, S_CR, S_SR, S_LC,
S_CM1, S_BC0, S_CE, S_SE, } t_state;

/* A state transition: */
/* TODO: buf is used in one entry only => hardcode: */
typedef struct
{ t_state state; /* state */
int c; /* input */
t_state next; /* next state */
char buf; /* buffer input? */
char print; /* print input? */
} t_trans;

/* Transition table: */
/* EOF input denotes the fall-back: */
t_trans xtab[] = {
/* state char next buffer print */
{ S_CD, '/' , S_CM1, 1, 0 },
{ S_CD, '"' , S_SR, 0, 1 },
{ S_CD, '\'', S_CR, 0, 1 },
{ S_CD, EOF , S_CD, 0, 1 },
{ S_CM1, '/' , S_LC, 0, 0 },
{ S_CM1, '*' , S_BC, 0, 0 },
{ S_CM1, EOF , S_CD, 0, 1 },
{ S_LC, '\n', S_CD, 0, 1 },
{ S_LC, EOF , S_LC, 0, 0 },
{ S_BC, '*' , S_BC0, 0, 0 },
{ S_BC, EOF , S_BC, 0, 0 },
{ S_BC0, '/' , S_CD, 0, -1 },
{ S_BC0, '*' , S_BC0, 0, 0 }, /* <== thanks to Bart's test case */
{ S_BC0, EOF , S_BC, 0, 0 },
{ S_SR, '\\', S_SE, 0, 1 },
{ S_SR, '"' , S_CD, 0, 1 },
{ S_SR, EOF , S_SR, 0, 1 },
{ S_SE, EOF, S_SR, 0, 1 },
{ S_CR, '\\', S_CE, 0, 0 },
{ S_CR, '\'', S_CD, 0, 1 },
{ S_CR, EOF, S_CR, 0, 1 },
{ S_CE, EOF, S_CR, 0, 0 }
};

static void flushbuf( char * buf, char print )
{ if( print ) printf( "%s", buf );
buf[0] = '\0';
}

/* Perform a step on the state machine: */
/* OPT: nested array, indexed by 1) .state and 2) .char to avoid the loop: */
/* i.e.: trans = state_tab[state][c] */
void step( t_state *state, int c, char *buf )
{ t_trans *trans;
int i;

i = 0;
while( 1 ) /* transition look up */
{ trans = &xtab[i];
i = i + 1;

if( trans->state != *state ) continue;
if( trans->c != c )
if( trans->c != EOF ) continue;

break;
}

flushbuf( buf, trans->print == 1 );
if( trans->print != 0 )
{ if( trans->print == -1 ) putchar( ' ' );
else putchar( c ); }

if( trans->buf ) buf[1] = c;

*state = trans->next;
}

void main( void )
{ int c, p, stop;
char buf[80]; /* either keep it large enough or use dynamic memory */
char cont;
t_state state;

stop = 0;
cont = 0;
state = S_CD;
buf[0]='\0';
c = EOF;
while( 1 )
{ p = c; c = getc( stdin );

if( p == '\\' && c == '\n' )
{ cont = 1; continue; }

if( cont )
{ cont = 0;
strcat(buf, "\\\n");
continue;
}

if( p != EOF )
{ step( &state, p, buf ); }

if( c == EOF )
{ break; }
}
}

Anton Shepelev

unread,
Dec 8, 2020, 6:54:43 PM12/8/20
to
Malcolm,

you program is not even trying :-)
Test it with:

a/*b*/c

--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev

unread,
Dec 8, 2020, 7:03:36 PM12/8/20
to
Tim Rentsch:

> The program is to remove (see below) C comments from a C
> source file input, and nothing else. An input with no C
> comments in it should be transmitted unchanged (provided
> its compile-time behavior is defined, see below). To give
> an obvious example, a multi-line macro definition that
> uses at the end of lines to continue the definition (but
> has no comments) should appear in the output exactly as in
> the input.

Are you certain that the requirement to keep escape line
endings in the code is satisifiable within O(1) memory
requirements? What about the following test case:

1: void main( void )
2: {a/\
3: \
4: b
5: }

where line 3 is repeated 1 000 000 times? Does anybody's
program process this input correctly, no? What about the
same with a modified line 4:

4: /b

Here are two small versions of this test:

test A: https://pastebin.com/raw/GcfBZNGY
test B: https://pastebin.com/raw/xZ1qAKgk

This the complications that we get because of unnatural
processing order required to keep escaped line endings in
the code!

Ben Bacarisse

unread,
Dec 8, 2020, 8:00:27 PM12/8/20
to
Tim Rentsch <tr.1...@z991.linuxsc.com> writes:

> Short problem statement: a C program to remove comments from C
> source input.

The details of the specification seemed a little off to me but I didn't
feel anyone had really mailed it yet, so I had a go myself.

I don't think I've really nailed it either. The requirement to preserve
uncommented line continuations breaks the natural layering of the input
process. I've done done that with a hack: line continuations are echoed
as they are read unless this is suppressed by passing a null output file
pointer to get_logical. This passes "state" information ("are we in a
comment?") down into the lowest level input function which I'm not keen
about. However, I think it works.

Nothing else is really interesting about it. Characters from logical
lines are read, and a function is called to handle each of the different
cases.

I view of Bart's comments about functions for the sake of them, I
suppose it's worth mentioning that once I'd found myself writing a test

if ((c = get_logical(in, out)) != X && x != EOF ...

a couple of times, I decided to write a one-line function so I could
write

if (next_not_eof_or(X, &c, in, out) ...

instead. I think it just about pays its way.


#include <stdio.h>
#include <stdbool.h>

int get_logical(FILE *in, FILE *out);
bool next_not_eof_or(int end_ch, int *cp, FILE *in, FILE *out);
void skip_quoted(FILE *in, FILE *out, int quote_char);
void possible_comment(FILE *in, FILE *out);
void strip_comments(FILE *in, FILE *out);


int main(int argc, char **argv)
{
FILE *in = argc > 1 ? fopen(argv[1], "r") : stdin;
if (in) strip_comments(in, stdout);
if (in != stdin) fclose(in);
}

void strip_comments(FILE *in, FILE *out)
{
int c;
while ((c = get_logical(in, out)) != EOF)
if (c == '"' || c == '\'')
skip_quoted(in, out, c);
else if (c == '/')
possible_comment(in, out);
else fputc(c, out);
}

void skip_quoted(FILE *in, FILE *out, int quote_char)
{
int c;
fputc(quote_char, out);
while (next_not_eof_or(quote_char, &c, in, out)) {
fputc(c, out);
if (c == '\\')
fputc(get_logical(in, out), out);
}
if (c == EOF)
fputs("Unterminated literal.\n", stderr);
else fputc(c, out);
}

void possible_comment(FILE *in, FILE *out)
{
int c = get_logical(in, NULL);
if (c == '/')
while (next_not_eof_or('\n', &c, in, NULL));
else if (c == '*')
while (next_not_eof_or('*', &c, in, NULL) ||
next_not_eof_or('/', &c, in, NULL));
else {
fputc('/', out);
if (c != EOF)
fputc(c, out);
return;
}
if (c == EOF)
fputs("Unterminated comment.\n", stderr);
else fputc(' ', out);
}

bool next_not_eof_or(int end_ch, int *cp, FILE *in, FILE *out)
{
return (*cp = get_logical(in, out)) != end_ch && *cp != EOF;
}

int get_logical(FILE *in, FILE *out)
{
int c = fgetc(in), next;
while (c == '\\' && (next = fgetc(in)) == '\n') {
if (out)
fputs("\\\n", out);
c = fgetc(in);
}
if (c == '\\')
ungetc(next, in);
return c;
}

--
Ben.

luser droog

unread,
Dec 8, 2020, 9:18:11 PM12/8/20
to
On Monday, December 7, 2020 at 3:44:53 PM UTC-6, luser droog wrote:
> On Saturday, December 5, 2020 at 10:25:36 AM UTC-6, Tim Rentsch wrote:
> > Prompted by some recent discussion regarding 'goto' statements and
> > state machines, I would like to propose a programming exercise.
> > (It is perhaps a bit too large to be called an exercise, but not so
> > difficult that it deserves the label of challenge. On the other
> > hand there are some constraints so maybe challenge is apropos. In
> > any case somewhere in between those two bounds.)
> >
> > Short problem statement: a C program to remove comments from C
> > source input.
> >
> > Specifics: Remove both /*...*/ and //... style comments. Don't
> > worry about trigraphs. Read from stdin, write to stdout, and
> > diagnostics (if any) go to stderr. If EOF is seen inside a
> > comment, do something sensible but it doesn't matter what as long
> > as it's sensible. Use no 'goto' statements. Limit function
> > bodies to no more than 25 lines.
> >
> > Other: feel free to handle corner cases as you see fit, as long
> > as there is some description of what choice was made.
> >
> > Hopefully it will be a fun exercise. It isn't trivial but it
> > shouldn't take too long either.
>
> Rough translation of Ben Bacarisse's Haskell program to C. The logic
> could probably be made nicer with a pattern matching construct.
> But the way I would do that would require another type for Symbols.
>

Version 2. I add a match() function that helped simplify the big functions,
but also added to the number of big functions.

This version ignores line continuations except while slurping a single
line comment. But that means it can't see any comment markers if they
are hidden behind line continuations.

strpcom-v2.c:

//#define DEBUG(...) __VA_ARGS__
#define DEBUG(...)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef union uobject *object;
typedef object fSuspension( object );
typedef object list;
typedef enum tag { INVALID, INTEGER, LIST, SUSPENSION, VOID } tag;

union uobject { tag t;
struct { tag t; int i; } Int;
struct { tag t; object a, b; } List;
struct { tag t; object v; fSuspension *f; } Suspension;
struct { tag t; void *v; } Void;
};


list skip_quote( list o );
list nested_comment( list o );
void print( list o );


object add_global_root( object o ){ return o; }
object alloc(){
return calloc( 1, sizeof(object) );
}

#define OBJECT(...) new_( (union uobject[]){{ __VA_ARGS__ }} )
object new_( object a ){
object p = alloc();
return p ? *p = *a, p : 0;
}


object Int( int i ){ return OBJECT( .Int = { INTEGER, i } ); }

list cons( object a, object b ){ return OBJECT( .List = { LIST, a, b } ); }
list one( object a ){ return cons( a, NULL ); }
object car( list o ){ return o && o->t == LIST ? o->List.a : NULL; }
object cdr( list o ){ return o && o->t == LIST ? o->List.b : NULL; }

object Void( void *v ){ return OBJECT( .Void = { VOID, v } ); }

object Suspension( void *v, fSuspension *f ){
return OBJECT( .Suspension = { SUSPENSION, v, f } );
}

int eq( object a, object b ){
return !a && !b ? 1 :
!a || !b ? 0 :
a->t != b->t ? 0 :
!memcmp( a, b, sizeof *a );
}

int eqint( object a, int i ){
union uobject b = { .Int = { INTEGER, i } };
return eq( a, &b );
}

object force( object o ){
return !o || o->t != SUSPENSION ? o :
force( o->Suspension.f( o->Suspension.v ) );
}

list take( int i, list o ){
return i == 0 ? NULL :
!o ? NULL :
( *o = *force( o ),
cons( car( o ), take( i-1, cdr( o ) ) ) );
}
list drop( int i, list o ){
return i == 0 ? o :
!o ? NULL :
( *o = *force( o ),
drop( i-1, cdr( o ) ) );
}

object first( list o ){ return car( take( 1, o ) ); }
object rest( list o ){ return drop( 1, o ); }

int match( object pat, object it, object *matched, object *tail ){
if( !pat ){
*tail = it;
return 1;
}
if( (it ? it->t : 0) == SUSPENSION ) *it = *force( it );
if( pat->t != (it ? it->t : 0) ) return 0;
switch( pat->t ){
default: return 0;
case LIST: {
object sink;
if( match( first( pat ), first( it ), & sink, tail ) ){
*matched = it;
return match( rest( pat ), rest( it ), & sink, tail );
}
} break;
case INTEGER:
if( eq( pat, it ) ){
*matched = it;
return 1;
}
}
return 0;
}

static list
force_chars_from_file( object file ){
FILE *f = file->Void.v;
int c = fgetc( f );
return c != EOF ? cons( Int( c ), Suspension( file, force_chars_from_file ) )
: cons( Int( EOF ), NULL );
}

list chars_from_file( FILE *f ){
return f ? Suspension( Void( f ), force_chars_from_file ) : NULL;
}

list logical_lines( list o ){
DEBUG( fprintf( stderr, "[%c%c]", first( o )->Int.i, first( rest( o ) )->Int.i ); )
static list pat;
if( !pat ) pat = add_global_root( cons( Int( '\\' ), one( Int( '\n' ) ) ) );

object matched, tail;
if( match( pat, o, &matched, &tail ) ){
return Suspension( tail, logical_lines );
} else {
object a = first( o );
return eqint( a, EOF ) ? o : cons( a, Suspension( rest( o ), logical_lines ) );
}
}

list strip_comments( list o ){
DEBUG( fprintf( stderr, "<%c%c>", first( o )->Int.i, first( rest( o ) )->Int.i ); )
static list single, multi;
if( !single ) single = add_global_root( cons( Int( '/' ), one( Int( '/' ) ) ) );
if( !multi ) multi = add_global_root( cons( Int( '/' ), one( Int( '*' ) ) ) );

object matched, tail;
if( match( single, o, &matched, &tail ) ){
do {
tail = rest( tail );
matched = first( tail );
if( eqint( matched, '\\' ) ) tail = rest( tail ); // eat \NL
} while( !eqint( matched, '\n' ) );
return Suspension( tail, strip_comments );
} else if( match( multi, o, &matched, &tail ) ){
return cons( Int( ' ' ), strip_comments( nested_comment( rest( tail ) ) ) );
}

object a = first( o );
if( eqint( a, '\'' ) || eqint( a, '"' ) ) return cons( a, skip_quote( o ) );
return eqint( a, EOF ) ? o : cons( a, Suspension( rest( o ), strip_comments ) );
}

list nested_comment( list o ){
DEBUG( fprintf( stderr, "(%c%c)", first( o )->Int.i, first( rest( o ) )->Int.i ); )
static list end;
if( !end ) end = add_global_root( cons( Int( '*' ), one( Int( '/' ) ) ) );

object matched, tail;
if( match( end, o, &matched, &tail ) ) return tail;
if( eqint( car( o ), EOF ) ) fprintf( stderr, "Unterminated comment\n"), exit( 1 );
return nested_comment( rest( o ) );
}

// quote character q is "curried" onto the front of the list
list skip_quote( list o ){
object q = car( o );
o = cdr( o );
object a = first( o );
if( eqint( a, '\\' ) ){
return cons( a, cons( first( rest( o ) ), skip_quote( cons( q, drop( 2, o ) ) ) ) );
} else if( eq( a, q ) ){
return cons( a, strip_comments( rest( o ) ) );
}
if( eqint( a, EOF ) ) fprintf( stderr, "Unterminated literal\n"), exit( 1 );
return cons( a, skip_quote( cons( q, rest( o ) ) ) );
}

void print( list o ){
switch( o ? o->t : 0 ){
default: break;
case INTEGER: if( o->Int.i != EOF ) fputc( o->Int.i, stdout ); break;
case SUSPENSION:
case LIST: print( first( o ) );
print( rest( o ) ); break;
}
}


int main(){
list input = chars_from_file( stdin );
print( strip_comments( input ) );
//print( strip_comments( logical_lines( input ) ) );
}


$ make strpcom-v2
cc -std=c99 -g -Wall -Wpedantic -Wextra -Wno-unused-function -Wno-unused-parameter -Wno-switch -Wreturn-type -Wunused-variable strpcom-v2.c -o strpcom-v2
$ cat test
strip //comments
line \
continue
// comment \
continue
/* nested /* comment
*/
"/*quoted*/"

""

"\""

"\\"

"\\\""

"foo\
bar"
/\
\
\
* comment *\
\
\
/

a = b /\
\
\
\
\
c;

// This is a multi-line \
single line comment

"abc\\
""

eof
$ ./strpcom-v2 < test
strip
line \
continue


"/*quoted*/"

""

"\""

"\\"

"\\\""

"foo\
bar"
/\
\
\
* comment *\
\
\
/

a = b /\
\
\
\
\
c;



Unterminated literal

Richard Damon

unread,
Dec 8, 2020, 9:21:52 PM12/8/20
to
There was no requirement that the answer be O(1), and in fact the
requirement to keep the line pasting when not in comments seems to say
that the processing needs to keep a count of how many of these have been
seen betwen the initial / and the next character so that if it isn't a *
or another / the / and the \<nl>s can be sent out. while if it is one of
those they can be omitted.

The other option would be to save the output location of stdout before
the / is output and rewind back to there, IF we had a promise that
stdout was actually a file (which in general we aren't)).

I think it is safe to assume that an unsigned long will be able to count
the number of escaped line ends we have seen, as if the file exceeds
that it will exceed reasonable implementation limits.

Ben Bacarisse

unread,
Dec 8, 2020, 10:09:36 PM12/8/20
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Tim Rentsch <tr.1...@z991.linuxsc.com> writes:
>
>> Short problem statement: a C program to remove comments from C
>> source input.
>
> The details of the specification seemed a little off to me but I didn't
> feel anyone had really mailed it yet, so I had a go myself.
>
> I don't think I've really nailed it either. The requirement to preserve
> uncommented line continuations breaks the natural layering of the input
> process. I've done done that with a hack: line continuations are echoed
> as they are read unless this is suppressed by passing a null output file
> pointer to get_logical. This passes "state" information ("are we in a
> comment?") down into the lowest level input function which I'm not keen
> about. However, I think it works.

Nope. That version did not preserve continuations that follow a /
which, in the end, is not the start of a comment. They need to be
counted (in this program organisation). Annoyingly, I started by just
counting them but that complicated the code the called get_logical. So
then thought I could make things simpler by just conditionally echoing
them instead, whereas what I needed to do is to always either echo them
/or/ counts them. With this organisation there is only one place where
they have to be put back.

Version 2:

#include <stdio.h>
#include <stdbool.h>

int get_logical(FILE *in, FILE *out, unsigned *count_cont);
bool next_not_eof_or(int end_ch, int *cp, FILE *in, FILE *out);
void skip_quoted(FILE *in, FILE *out, int quote_char);
void possible_comment(FILE *in, FILE *out);
void strip_comments(FILE *in, FILE *out);


int main(int argc, char **argv)
{
FILE *in = argc > 1 ? fopen(argv[1], "r") : stdin;
if (in) strip_comments(in, stdout);
if (in != stdin) fclose(in);
}

void strip_comments(FILE *in, FILE *out)
{
int c;
while ((c = get_logical(in, out, 0)) != EOF)
if (c == '"' || c == '\'')
skip_quoted(in, out, c);
else if (c == '/')
possible_comment(in, out);
else fputc(c, out);
}

void skip_quoted(FILE *in, FILE *out, int quote_char)
{
int c;
fputc(quote_char, out);
while (next_not_eof_or(quote_char, &c, in, out)) {
fputc(c, out);
if (c == '\\')
fputc(get_logical(in, out, 0), out);
}
if (c == EOF)
fputs("Unterminated literal.\n", stderr);
else fputc(c, out);
}

void possible_comment(FILE *in, FILE *out)
{
unsigned n_cont = 0;
int c = get_logical(in, out, &n_cont);
if (c == '/')
while (next_not_eof_or('\n', &c, in, out));
else if (c == '*')
while (next_not_eof_or('*', &c, in, out) ||
next_not_eof_or('/', &c, in, out));
else {
fputc('/', out);
while (n_cont--)
fputs("\\\n", out);
if (c != EOF)
fputc(c, out);
return;
}
if (c == EOF)
fputs("Unterminated comment.\n", stderr);
else fputc(' ', out);
}

bool next_not_eof_or(int end_ch, int *cp, FILE *in, FILE *out)
{
unsigned dummy;
return (*cp = get_logical(in, out, &dummy)) != end_ch && *cp != EOF;
}

int get_logical(FILE *in, FILE *out, unsigned *count_cont)
{
int c = fgetc(in), next;
while (c == '\\' && (next = fgetc(in)) == '\n') {
if (count_cont)
++*count_cont;
else fputs("\\\n", out);

Tim Rentsch

unread,
Dec 9, 2020, 1:49:31 AM12/9/20
to
Richard Damon <Ric...@Damon-Family.org> writes:

> On 12/7/20 3:37 AM, jacobnavia wrote:
>
>> I corrected that. The problem is that if I read a '*', I read the next
>> character to see if it is a '/' that would close the multi-line comment.
>> If it is NOT, I ignore the character and just go on reading the next
>> character. If the next character is a '/' I ignore it too since it
>> wasn't after a '*', that I discarded!
>>
>> Solution: Just make an ungetc. Corrected version is posted soon.
>
> Been thinking a bit of the problem, and this shows one of the traps in
> it. We are asked to remove the comments but the lines ending in \ being
> connected together aren't in general comments, so unless they are inside
> comments need to be retained, so we can't just remove them at an inner
> level or we are doing an transformation that wasn't in the spec.

Right.

Tim Rentsch

unread,
Dec 9, 2020, 1:53:39 AM12/9/20
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Tim Rentsch <tr.1...@z991.linuxsc.com> writes:
>
>> It's a nice program, but it doesn't quite do what was asked for.
>> The problem statement says to remove comments. This program does
>> remove comments, but it also takes out line splices to replace
>> multiple physical lines with their corresponding logical line.
>> Line continuation sequences outside of comments should be
>> preserved in the output.
>
> Ah, I didn't see that in the specification. Given the phases described
> in the standard, I implemented what I guessed was the expected
> behaviour.

Understood. In hindsight I realize the problem statement
wasn't clear enough.

More on your other comments which I hope to get to soon.

Tim Rentsch

unread,
Dec 9, 2020, 1:55:36 AM12/9/20
to
Bart <b...@freeuk.com> writes:

[...]

> The spec did say to make your own decisions on corner cases.

Corner cases are meant to be only for input that the C
standard specifies as undefined behavior.

Tim Rentsch

unread,
Dec 9, 2020, 1:59:33 AM12/9/20
to
Bart <b...@freeuk.com> writes:

> On 07/12/2020 13:33, Ben Bacarisse wrote:
>
>> Bart <b...@freeuk.com> writes:
[...]

>>> It doesn't deal with \ line continuation inside //:
>>>
>>> ... /\
>>> / comment
>>
>> There are other issues with them as well of course. Dealing with them
>> while leaving them in (when uncommented) seems unnatural to me given
>> then description of the compiler's phases in the standard.
>
> If you mean needing to deal with /\/ (with a line break after \) in
> order to recognise the // comment, but leaving ma\in unchanged, that
> does seem at odds with how C is normally processed.
>
> But this is isn't normal processing, it's removing comments while
> preserving as much as possible of non-comments.

Yes, that description matches the behavior intended.

Tim Rentsch

unread,
Dec 9, 2020, 2:03:23 AM12/9/20
to
Richard Damon <Ric...@Damon-Family.org> writes:

> On 12/7/20 9:31 AM, Bart wrote:
>
>> On 07/12/2020 14:18, Bart wrote:
>>
>>> (Except I don't personally agree with splitting tokens across lines,
>>> so will not implement it. Anyone who splits "//" across two, or
>>> possibly N lines, should be fired.
>>>
>>> It's not even a feature that would allow arbitrary insertion of
>>> \<newline> anywhere (eg. inside string literals, or just before an
>>> existing \<newline>) so I can't see the point.
>>
>> I keep forgetting that \ would be followed by <newline>, so maybe it
>> /is/ possible to allow it everywhere, even put every character of a
>> source file on its own line.
>>
>> I still think it is not necessary to allow tokens to be split across lines.
>>
>> Actually it's funny that C allows so much freedom with \ given that,
>> unlike many languages, C hardly ever needs to use \ to continue code
>> into a following line. (I think only for macro bodies; the //abc\ case
>> is a misfeature.)
>
> I think it came about as C was (is?) a common 'output' language for many
> code generators, and coupled with fixed 80 column screens, having a
> fixed output width of the code was important, so being able to just
> split the code anywhere was useful.

I expect it came about in the very early days of C when the C
preprocessor was a completely separate program and knew nothing
of C syntax, in which case having it be completely independent
of the language syntax is about the only thing that makes sense.

Tim Rentsch

unread,
Dec 9, 2020, 2:17:25 AM12/9/20
to
Anton Shepelev <anto...@gmail.com> writes:

Sorry for the delay in responding. For some unknown reason my
news service wasn't allowing messages to be posted for a day
or two.

> Tim Rentsch:
>
>> It's a nice program, but it doesn't quite do what was
>> asked for. The problem statement says to remove comments.
>> This program does remove comments, but it also takes out
>> line splices to replace multiple physical lines with their
>> corresponding logical line. Line continuation sequences
>> outside of comments should be preserved in the output.
>
> I had not realised that, and agree with James Kuyper:
>
>> C comments exist only in the context described by the
>> standard. If it isn't in that context, it isn't properly a
>> C comment, it's a C-like comment. That context includes
>> the translation phases. Translation phase 2 is removal of
>> line continuations. Translation phase 3 includes removal
>> of comments. The translation phases are not required to
>> actually occur in the order specified, but it is required
>> that the end result be the same as if they did occur in
>> that order. Therefore, it doesn't make much sense to me to
>> describe a program as removing C comments, if it's also
>> supposed to preserve some, but not all escaped newlines.
>
> Such details as line continuations are more low-level than
> comments. They belong to the physical, "transport" level of
> the file, where the the distinction between code and
> comments is not yet available. Removing comments while
> keeping line continuations is turning the natural order of
> processing upside down. If you are not convinced, I will
> consider amending my program tomorrow evening.

The program being asked for is not a language processing tool.
Rather it is a source-to-source translation tool where both the
input /and the output/ are expected to be read by people. If
looked at from that perspective I think the most sensible choice
is to leave in the backslant-newline sequences. Multi-line
macros, for example, are easier to read if left as multiple lines
rather than as single very long lines.

In any case my intention for the exercise was (and is) for
comments to be removed but everything outside of comments to be
left as is. And, besides being what was intended, dealing with
the backslant-newline sequences is what makes the problem
interesting; simply stripping out the backslant-newline pairs
makes the problem almost trivial.

Tim Rentsch

unread,
Dec 9, 2020, 2:19:00 AM12/9/20
to
Anton Shepelev <anto...@gmail.com> writes:

> I wrote to Tim:
>
>> If you are not convinced, I will consider amending my
>> program tomorrow evening.
>
> Since he has kept the rules, I have modified my program.
> It is still inefficient, but hopefully more correct:
> [...]

I got this and have added it to the collection of posted
responses.

Tim Rentsch

unread,
Dec 9, 2020, 2:20:53 AM12/9/20
to
Anton Shepelev <anto...@gmail.com> writes:

> Tim Rentsch:
>
>> If EOF is seen inside a comment, do something sensible but
>> it doesn't matter what as long as it's sensible.
>
> Is just stopping the program not sensible, e.g.:
>
> void main( void ) {}
> /* commenging...
>
> may output
>
> void main( void ) {}
>
> and call it run?

Yes that's fine. You might want to issue an error message,
and/or give an error return status, but either way is
perfectly okay.
It is loading more messages.
0 new messages