SPECIAL_TOKEN is a rather odd term. Would a neophyte ever guess what
that means? What is "special" about it?
Well, it turns out that what is special about it is that a "special
token" is ignored by the parser machinery. OTOH, from the lexical
matching viewpoint, there is absolutely nothing special about a
"special" token. Now, as best I can figure, the only reason that
so-called special tokens exist is for comments. I cannot think of any
other use of the feature. Even if there is some other usage for them
that I am missing now, I think it's safe to say that, in 99% of cases,
where a grammar contains SPECIAL_TOKEN specifications, it is used for
comments.
So, initiallly, I thought about just renaming them to be COMMENT.
However, that would interfere with a very large number of grammars,
since COMMENT would become a freecc keyword and you could not call a
Token or production that any more. So, finally, I think that
COMMENT_TOKEN is better.Of course, it's trivial to continue supporting
the use of "SPECIAL_TOKEN" as a synonym.
If anybody has a better name for this, I am open to proposals.
A related topic that I am thinking about is SKIP. Just as
SPECIAL_TOKEN basically is used for comments, AFAICS, SKIP, in
practice, is used for whitespace. (For reasons similar to the
foregoing, one could consider renaming it to WHITESPACE, except that,
unlike SPECIAL_TOKEN, SKIP actually does describe pretty well what the
thing is, so it is not crying out to be renamed, so I would tend to
leave it be naming-wise.) Now, from the parser machinery viewpoint,
there is no real difference between SKIP and SPECIAL_TOKEN. Both are
ignored. The difference is that, where the SPECIAL_TOKEN's are
retained, basically they dangle off of the regular tokens (potentially
in a chain) in the specialToken field (which really should be renamed
as well) the SKIP tokens really are thrown away.
So, basically, the typical usage pattern that is implicit here is that
you keep the comments and you throw away the whitespace. Of course, if
you want to keep everything, you make your whitespace a SPECIAL_TOKEN
rather than a SKIP. So, I guess, on that level, it's not a very big
deal -- at least, once somebody understands what these things are. But
all of this leads me to the other question of the day:
Is the distinction between SKIP and SPECIAL_TOKEN worth the candle?
AFAICS, the only advantage to having SKIP in addition to SPECIAL_TOKEN
is that you can have the extra efficiency of not storing extra tokens
that contain the whitespace. Note also, that even if you throw away
the whitespaces characters, the whitespace info can basically be
reconstructed from the token's location info. You know, if a token
ends at line/column 12:43, say, and the very next token (potentially a
"special" token) begins at line/col 15:32, you know that there were
three newlines followed by 14 characters worth of horizontal
whitespace between the two tokens, right? The caveat is that, having
thrown away the whitespace chars you do not know the following:
(a) Was there trailing end-of-line whitespace (spaces and/or tabs)
after the first token?
(b) Were the 3 newlines DOS style or UNIX style? (\r\n vs. \n) Or
possibly the old mac-style.. (\r)
(c) What mix of tabs and spaces was the leading horizontal space
before the second token?
Welll, as for (a), I think we can say that, in general, nobody gives a
****. (Side issue: OTOH, this does matter to some people some of the
time, since programmer's editors typically have a configuration option
of whether you want to automatically remove (or keep) any trailing
tabs/spaces at the end of a line. (Whenever I see that, my immediate
reaction is to think to myself: "Why on earth should I give a ****?")
I guess the issue here is really with version control systems, where
you can create spurious diffs by having changes that consist only of
adding/removing trailing spaces, for example. Aside from that, it is
surely the most irrelevant thing imaginable.) As for (b) one usually
does not care about conserving this information, since your code knows
on a higher level what the default eol character is on the platform
it's on, so in *most* cases will use that to do appropriate things. As
for (c) you could well want to conserve the mix of tabs/spaces in the
leading whitespace on a line.
Anyway, the basic question here is on the pros and cons of simply
merging SPECIAL_TOKEN/SKIP into a single construct. Thus, all the
comments and whitespace would be conserved by default, including the
a, b, and c info above. The pros are:
(a) It makes things incrementally simpler, since rather than
TOKEN/SPECIAL_TOKEN/SKIP/MORE you now only have
TOKEN/SPECIAL_TOKEN/MORE --/ which I think would become
TOKEN/COMMENT_TOKEN/MORE. Incrementally less to explain/document -- or
implement/maintain in code...
(b) If somebody really wants to throw away lexical information like
tabs vs. spaces, end-of-line whitespace as above, they can do it but
they have to do it deliberately. The default out-of-the-box behavior
will be to keep all lexical info (even the super-picky stuff like
trailing end-of-line whitespace) so that the input file can be
reconstructed precisely from the generated AST (or just the stream of
tokens if you are not generating a tree.)
The cons are all about efficiency. You create and store all these
extra tokens that correspond to the whitespace. This has some
space/time cost (mostly space, I think) if you're not interested in
whitespace.
Anyway, the proposal under question is whether to treat SPECIAL_TOKEN
and SKIP completely symmetrically. So, basically, I am considering
just introducing the new terminology of COMMENT_TOKEN (or maaaaybe
IGNORED_TOKEN? I'm open to other proposals, as I said...) and then
SPECIAL_TOKEN and SKIP can be used as well for backward compatibility,
but then, under this proposal, they all mean the same thing, they just
are aliases for COMMENT_TOKEN.
Any thoughts?
Seasons Greetings,
JR