Revisiting split-sequence and patterns

Erik Naggum

unread,

Nov 27, 2001, 11:03:04 PM11/27/01

to

This problem occurred to me while trying to explain why comma-delimited
data formats should not use something as simple as split-sequence (or a
similar inline loop), so bear with me while I rehash an old issue and
walk through some background.

If there be patterns of syntax design, one of them would be escaping or
quoting a character so it reverts from (potentially) special to normal
interpretation using one special character solely for escaping purposes,
backslash being the canonical choice for this. (Another pattern of
syntax design would be to use the backslash to _make_ characters special
and thus mean something else entirely. It is thus important to recognize
which pattern the backslash is used to implement in a particular syntax.)
The escaping character is discarded from the extracted token. Call this
the single-escape mechanism.

Since Common Lisp adheres to this pattern of syntax design, I would have
expected that a function that split a string into a list of tokens that
were delimited by a special character would allow a token to contain that
delimiter if it were escaped.

If there be more patterns of syntax design, another one would involve
escaping a whole sequence of characters so they revert from (potentially)
special to normal interpretation, using a single special character at
both ends of the sequence, and escaping that character and the escaping
character inside the sequence with the same character that does this for
a single character. All the escaping characters are discarded from the
extracted token. Call this the multiple-escape mechanism.

Since Common Lisp adheres to this pattern of syntax design, too, with
both its string syntax and the multiple-escape syntax for symbols, I
would also have expected that a function that split a string into a list
of tokens that had been delimited on either end by a special character
would allow a token to be so delimited and thus to avoid being split in
the middle of such a token if it contained the delimiter on which to
split the string.

There are a whole lot of other patterns of syntax design: whether it be
context-free or context-sensitive; how many characters (or tokens) of
read-ahead it require; whether the start and end markers of individual
elements be explicit or deduced. Common Lisp has a context-sensitive
grammar (strictly speaking, since every change to the readtable and even
the package changes the syntax, one cannot read a Common Lisp source file
correctly without evaluating certain top-level forms and performing any
evaluation required by the #. reader macro), described by the readtable,
which implements support for recursive-descent parsers that may change
the readtable while processing "their" portion of the input stream. (It
is easy to describe the (static) syntax of the standard readtable that
implements the standard language, but that is not sufficient for a real
Common Lisp system.) Common Lisp's syntax is in this particular way very
peculiar and breaks the patterns that have later been established for
context-free grammars and LALR parsers. (Most interesting programming
languages fail to adhere to these formal little things, anyway, but they
are sometimes helpful in classifying things.) Common Lisp's standard
function to read delimited lists, expects to terminate on a character and
cannot terminate on the end of input.

Yet another possible pattern is that of considering whitespace completely
ignorable, but also necessary to keep tokens apart, because almost no
other character break apart tokens. This means that whitespace differs
from other delimiters in that repeated instances should be collapsed, but
this is not really a feature of the delimiter, but of whitespace.

Since Common Lisp requires explicit start and end markers for almost all
tokens (except symbols), and the end marker is already an argument to
read-delimited-list, I think we have presedence for this argument to a
function that parses a string, too.

Briefly, a "split-sequence" that adhers to these patterns would accept:

1 a single-escape character, which defaults to #\\, but the argument may be
nil to prevent this functionality

2 a designator for a bag, i.e., character or sequence of characters, that
are multiple-escape characters, which defaults to (#\"), but which can
usefully be (#\" #\|) or (#\" #\'). If the bag is empty, no characters
will be treated as multiple-escape characters. If two multiple-escape
characters are adjacent and thus complete a "hard empty" field, it is
returned as an empty string.

3 a designator for a bag of characters that are to be considered whitespace
and are thus to be ignored apart from their effect on terminating a token
unless escaped. E.g., (#\space #\tab).

4 an internal delimiter character that separates tokens, where tokens are
considered "soft empty" if delimiters are adjacent (ignoring whitespace,
if any) and if so, are returned as nil.

5 a designator for a bag of terminating delimiters that cause the parsing
to stop and return the tokens collected. If nil as a whole or the symbol
:eof in a sequence, only the end of stream or string will terminate the
parsed list of tokens, provided that it is not escaped.

I think the name parse-delimited-list would be descriptive and fitting.

A note on whitespace. If some character that is usually a whitespace
character is to be a delimiter in its own right, such as #\tab, the bag
of whitespace characters should be empty and it be the internal delimiter
-- or it will prevent adjacent delimiters from creating an empty field.

A note on bags. The function string-trim and friends are the only
functions I can find that accepts a bag of anything in Common Lisp. A
bag can be any sequence type, but if non-characters are required, such as
:eof, it must be a list or a vector.

A note on designators for bags of characters. Like make-array, I think
it is useful to supply either a character or :eof without having to put
them in a sequence just for the sake of this argument, so they designate
a singleton sequence containing itself. Note that both nil and a string
of one character is already a bag.

A note on "hard" and "soft" empty fields. There may already be good
reason to supply a "default list" for fields that are empty, modeled on
make-pathname's defaults argument, but if so, we need a way to specify a
field that is supplied as empty from one that is defaulted. In the
presence of multiple-escape characters, it is possible to create the
empty string explicitly ("hard"), as opposed to implicitly by omitting
characters between delimiters ("soft"). Only a "soft empty" field would
be eligible for defaulting from the defaults list. Note that supplying a
short defaults list (which includes not supplying one) will default
elements to nil, which is what a "soft empty" field would be returned as
in the absence of a default list, so it need not be a special case.

I hope it is evident how I have picked elements from several functions
already in Common Lisp. I hope the result has a "Common Lisp feel",
which is also what patterns are about to me. I have been a little queasy
about the split-sequence "feel", which to me has a more C- or Perl-like
feel to it. (I know, I posted an early version and I am partly to blame.
But then we age. Or something.)

///
--
The past is not more important than the future, despite what your culture
has taught you. Your future observations, conclusions, and beliefs are
more important to you than those in your past ever will be. The world is
changing so fast the balance between the past and the future has shifted.

Jeff Greif

unread,

Nov 28, 2001, 12:37:59 AM11/28/01

to

This basically seems like a good idea. However, it might be worthwhile
to allow separate bags for the start and end delimiters in multiple
escapes. Sometimes the tokens might be enclosed like this `some token'
or [other token] or {there exists this token}. It may also be useful to
specify the start and end delimiters in matching pairs, so that
"SELECT * FROM LISPERS WHERE NAME = 'Erik Naggum'" could be treated as a
single token in some circumstances where both single and double-quote
were allowed as multiple-escape delimiters, but starting with
double-quote meant you have to end with double-quote, and other
delimiting characters appearing inside the double quotes would not be
escapes.

Jeff

Erik Naggum

unread,

Nov 28, 2001, 2:48:30 AM11/28/01

to

* Jeff Greif

| Sometimes the tokens might be enclosed like this `some token' or [other
| token] or {there exists this token}.

The reason this is out of place is that these things naturally nest.

| It may also be useful to specify the start and end delimiters in matching
| pairs, so that "SELECT * FROM LISPERS WHERE NAME = 'Erik Naggum'" could
| be treated as a single token in some circumstances where both single and
| double-quote were allowed as multiple-escape delimiters, but starting
| with double-quote meant you have to end with double-quote, and other
| delimiting characters appearing inside the double quotes would not be
| escapes.

I already covered that. Please have another look:

If there be more patterns of syntax design, another one would involve
escaping a whole sequence of characters so they revert from (potentially)
special to normal interpretation, using a single special character at

--> BOTH ENDS of the sequence, and escaping that character and the escaping

character inside the sequence with the same character that does this for
a single character. All the escaping characters are discarded from the
extracted token. Call this the multiple-escape mechanism.

///

Thomas F. Burdick

unread,

Nov 28, 2001, 5:33:36 PM11/28/01

to

Erik Naggum <er...@naggum.net> writes:

> This problem occurred to me while trying to explain why comma-delimited
> data formats should not use something as simple as split-sequence (or a
> similar inline loop), so bear with me while I rehash an old issue and
> walk through some background.

Oh, I'm going to have to rewrite my vector splitters with this in
mind. I feel much better about them now. They've been useful for
some things, particularly lines coming in from some Unix utility, but
I've been doing my own parsing when I need escaping. Now that you
pointed this out, I feel kind of foolish for doing this.

This would leave something like split-sequence as a special case of a
more general utility. I still think that a function returning the
entire parsed output should be a special case of a more general
call-while-splitting type of function.

If I have some spare time this evening, I'll try to do this, rewrite
some code to use it, and post any insights I get from that.

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Jeff Greif

unread,

Nov 30, 2001, 10:50:09 PM11/30/01

to

I originally sent this on the morning of 28 November, but it never made
it to the newsgroup.

Jeff

"Erik Naggum" <er...@naggum.net> wrote in message
news:32159225...@naggum.net...

> * Jeff Greif
> | Sometimes the tokens might be enclosed like this `some token' or
[other
> | token] or {there exists this token}.
>
> The reason this is out of place is that these things naturally nest.
>

According to the Apache webserver docs, this is a line from a webserver
error log (broken to fit newsgroup linewidth):

[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by
server configuration: /export/home/live/ap/htdocs/test

I'm not sure your reason is strong enough to allow such a line to fall
outside the usage of this new split-sequence functionality. A
compromise might be to allow delimiter pairs, but not nesting. Or if
they are used, only the outermost ones will be treated as escapes, and
they must balance inside. This is somewhat ugly.

It might be useful to add one more syntax pattern -- characters that are
punctuation (delimit tokens) but also *are* tokens (unless escaped or
appearing inside the multiple escaped sequence). Using that pattern,
the line above could be tokenized as ( "[" "Wed" "Oct" "11" ... "]" ...)
and a higher level of parsing could handle the rectangular brackets.

Jeff

Erik Naggum

unread,

Dec 1, 2001, 8:30:57 AM12/1/01

to

* Jeff Greif

| I'm not sure your reason is strong enough to allow such a line to fall
| outside the usage of this new split-sequence functionality.

I am. You need a real parser, not just a "split" function to get this
kind of functionality. OK, I called my function "parse-delimited-list"
and that was probably misleading, but if we do this, it will grow into a
full-fledged programmable parser like the Common Lisp reader, and then it
is a better project to make the reader more powerful. (I am working on
that, too, in particular to get a handle on the function that decides
when something is a symbol.)

| It might be useful to add one more syntax pattern -- characters that are
| punctuation (delimit tokens) but also *are* tokens (unless escaped or
| appearing inside the multiple escaped sequence). Using that pattern,
| the line above could be tokenized as ( "[" "Wed" "Oct" "11" ... "]" ...)
| and a higher level of parsing could handle the rectangular brackets.

I think this is a much better design. There is but one drawback, which
may be a feature, and that is that you no longer have any special rules
inside the delimiter pairs, which I would kind of expect for a language
that uses matching delimiters.

Jeff Greif

unread,

Dec 1, 2001, 12:19:26 PM12/1/01

to

"Erik Naggum" <er...@naggum.net> wrote in message

news:32162022...@naggum.net...

> * Jeff Greif
> | I'm not sure your reason is strong enough to allow such a line to
fall
> | outside the usage of this new split-sequence functionality.
>
> I am. You need a real parser, not just a "split" function to get
this
> kind of functionality. OK, I called my function
"parse-delimited-list"
> and that was probably misleading, but if we do this, it will grow
into a
> full-fledged programmable parser like the Common Lisp reader, and
then it
> is a better project to make the reader more powerful. (I am working
on
> that, too, in particular to get a handle on the function that
decides
> when something is a symbol.)
>

Pardon me for ruminating further. I'm not trying to insult your
intelligence by telling you what you already know.

I think the upper bound for the functionailty requirement of this tool
is lexical analysis -- quite a bit less than general lexical analysis,
and definitely not full parsing. The needed output is a sequence of
tokens that represent the 'atoms' in the grammar of the little language
that is being read. Parsing can be left to other code. In concrete
terms, when the Apache log file line:

[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by
server configuration: /export/home/live/ap/htdocs/test

is cracked, you want to get either the token

"Wed Oct 11 14:32:52 2000" or "[Wed Oct 11 14:32:52 2000]" or
a sequence of tokens "[" "Wed" "Oct" ... "]" for the first part. You
don't want to get "[Wed" "Oct" ... and have to look inside the token
"[Wed" to see if it contains the grouping construct '[' and then inside
all subsequent tokens until you find the one containing the matching
"]".

Unfortunately, to do lexical analysis of this sort for arbitrary little
languages, you do need something like a finite-state automaton
processor, or the Lisp reader, to support nesting, context-sensitivity,
etc. This new function you're proposing should not attempt to handle
all of these constructs -- for only a restricted subset of little
languages will it be suitable.

One way to decide where to draw the line in the functionality of this
operation is to survey the types of repetitive file formats that people
would like to be able process using a tool like this. Clearly, what
you've proposed is fine for most numeric tabular data (or exported
spreadsheets), and for textual tables like address books, which might
have entries containing a lot of empty tokens, such as (with comma as
separator):

,,,Ronnie Button,,ron...@phenome.zit.edu,,,,,,,,,,,,,,,,,,,,

Web server logs are another group. The Apache ones can be handled using
the 'punctuation as delimiter and token' pattern. Presumably other
formats, such as standard configuration files for applications, should
be looked at as well. Probably other tools would be better for XML or
other hierarchically-structured configuration files with arbitrary
nesting. But is there a point in handling whatever sendmail uses these
days, or the windows.ini style of configuration file? Tabular data
produced by Lisp might also be of interest -- perhaps a control string
for cl:format could be used as the specifier of the data format for
reading back. (Only certain kinds could be handled.)

In the end, a formal description of the languages that can be handled
would be helpful to users considering the tool.

Jeff