Revised, condensed proposal: rename TOKEN/SPECIAL_TOKEN/SKIP/MORE to REGULAR_TOKEN/IGNORED_TOKEN/INCOMPLETE_TOKEN

1 view
Skip to first unread message

Jonathan Revusky

unread,
Dec 29, 2008, 8:08:58 PM12/29/08
to freecc...@googlegroups.com
Okay, maybe someone will tell me what they think of this idea:

Rename TOKEN to REGULAR_TOKEN, SPECIAL_TOKEN and SKIP are treated
symmetrically and called IGNORED_TOKEN and MORE is renamed to
INCOMPLETE_TOKEN.

Note that the older names continue to work, but this is the preferred
terminology to be used in docs etcetera. If anybody has any objections
or what they think is a better proposal, please tell me. (The thinking
behind this is in the much longer message quoted below.)

Season's Greetings,

JR (getting that "shouting into the void" feeling.... :-))

On Sat, Dec 27, 2008 at 9:26 PM, Jonathan Revusky <rev...@gmail.com> wrote:
> Hi all,
>
> SPECIAL_TOKEN is a rather odd term. Would a neophyte ever guess what
> that means? What is "special" about it?
>
> Well, it turns out that what is special about it is that a "special
> token" is ignored by the parser machinery. OTOH, from the lexical
> matching viewpoint, there is absolutely nothing special about a
> "special" token. Now, as best I can figure, the only reason that
> so-called special tokens exist is for comments. I cannot think of any
> other use of the feature. Even if there is some other usage for them
> that I am missing now, I think it's safe to say that, in 99% of cases,
> where a grammar contains SPECIAL_TOKEN specifications, it is used for
> comments.
>
> So, initiallly, I thought about just renaming them to be COMMENT.
> However, that would interfere with a very large number of grammars,
> since COMMENT would become a freecc keyword and you could not call a
> Token or production that any more. So, finally, I think that
> COMMENT_TOKEN is better.Of course, it's trivial to continue supporting
> the use of "SPECIAL_TOKEN" as a synonym.
>
> If anybody has a better name for this, I am open to proposals.
>
> A related topic that I am thinking about is SKIP. Just as
> SPECIAL_TOKEN basically is used for comments, AFAICS, SKIP, in
> practice, is used for whitespace. (For reasons similar to the
> foregoing, one could consider renaming it to WHITESPACE, except that,
> unlike SPECIAL_TOKEN, SKIP actually does describe pretty well what the
> thing is, so it is not crying out to be renamed, so I would tend to
> leave it be naming-wise.) Now, from the parser machinery viewpoint,
> there is no real difference between SKIP and SPECIAL_TOKEN. Both are
> ignored. The difference is that, where the SPECIAL_TOKEN's are
> retained, basically they dangle off of the regular tokens (potentially
> in a chain) in the specialToken field (which really should be renamed
> as well) the SKIP tokens really are thrown away.
>
> So, basically, the typical usage pattern that is implicit here is that
> you keep the comments and you throw away the whitespace. Of course, if
> you want to keep everything, you make your whitespace a SPECIAL_TOKEN
> rather than a SKIP. So, I guess, on that level, it's not a very big
> deal -- at least, once somebody understands what these things are. But
> all of this leads me to the other question of the day:
>
> Is the distinction between SKIP and SPECIAL_TOKEN worth the candle?
>
> AFAICS, the only advantage to having SKIP in addition to SPECIAL_TOKEN
> is that you can have the extra efficiency of not storing extra tokens
> that contain the whitespace. Note also, that even if you throw away
> the whitespaces characters, the whitespace info can basically be
> reconstructed from the token's location info. You know, if a token
> ends at line/column 12:43, say, and the very next token (potentially a
> "special" token) begins at line/col 15:32, you know that there were
> three newlines followed by 14 characters worth of horizontal
> whitespace between the two tokens, right? The caveat is that, having
> thrown away the whitespace chars you do not know the following:
>
> (a) Was there trailing end-of-line whitespace (spaces and/or tabs)
> after the first token?
> (b) Were the 3 newlines DOS style or UNIX style? (\r\n vs. \n) Or
> possibly the old mac-style.. (\r)
> (c) What mix of tabs and spaces was the leading horizontal space
> before the second token?
>
> Welll, as for (a), I think we can say that, in general, nobody gives a
> ****. (Side issue: OTOH, this does matter to some people some of the
> time, since programmer's editors typically have a configuration option
> of whether you want to automatically remove (or keep) any trailing
> tabs/spaces at the end of a line. (Whenever I see that, my immediate
> reaction is to think to myself: "Why on earth should I give a ****?")
> I guess the issue here is really with version control systems, where
> you can create spurious diffs by having changes that consist only of
> adding/removing trailing spaces, for example. Aside from that, it is
> surely the most irrelevant thing imaginable.) As for (b) one usually
> does not care about conserving this information, since your code knows
> on a higher level what the default eol character is on the platform
> it's on, so in *most* cases will use that to do appropriate things. As
> for (c) you could well want to conserve the mix of tabs/spaces in the
> leading whitespace on a line.
>
> Anyway, the basic question here is on the pros and cons of simply
> merging SPECIAL_TOKEN/SKIP into a single construct. Thus, all the
> comments and whitespace would be conserved by default, including the
> a, b, and c info above. The pros are:
>
> (a) It makes things incrementally simpler, since rather than
> TOKEN/SPECIAL_TOKEN/SKIP/MORE you now only have
> TOKEN/SPECIAL_TOKEN/MORE --/ which I think would become
> TOKEN/COMMENT_TOKEN/MORE. Incrementally less to explain/document -- or
> implement/maintain in code...
>
> (b) If somebody really wants to throw away lexical information like
> tabs vs. spaces, end-of-line whitespace as above, they can do it but
> they have to do it deliberately. The default out-of-the-box behavior
> will be to keep all lexical info (even the super-picky stuff like
> trailing end-of-line whitespace) so that the input file can be
> reconstructed precisely from the generated AST (or just the stream of
> tokens if you are not generating a tree.)
>
> The cons are all about efficiency. You create and store all these
> extra tokens that correspond to the whitespace. This has some
> space/time cost (mostly space, I think) if you're not interested in
> whitespace.
>
> Anyway, the proposal under question is whether to treat SPECIAL_TOKEN
> and SKIP completely symmetrically. So, basically, I am considering
> just introducing the new terminology of COMMENT_TOKEN (or maaaaybe
> IGNORED_TOKEN? I'm open to other proposals, as I said...) and then
> SPECIAL_TOKEN and SKIP can be used as well for backward compatibility,
> but then, under this proposal, they all mean the same thing, they just
> are aliases for COMMENT_TOKEN.
>
> Any thoughts?
>
> Seasons Greetings,
>
> JR
>

Reply all
Reply to author
Forward
0 new messages