Determining whitespace

Matthew X. Economou

unread,

Oct 12, 2002, 11:22:36 AM10/12/02

to

Is there an implementation-independent way to determine if a character
is considered whitespace? I'm looking for the equivalent of the
isspace() function in the standard C library, but the permuted symbol
index in the CLHS only lists predicates for alphabetic, digit, and
graphics characters. There also doesn't seem to be an implementation-
independent way to query the reader for this information.

Am I missing anything obvious? Will I just have to roll my own
predicate?

--
Matthew X. Economou <xeno...@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)
Max told his friend that he'd just as soon not go hiking in the hills.
Said he, "I'm an anti-climb Max." [So is that punchline.]

Erik Naggum

unread,

Oct 12, 2002, 1:22:43 PM10/12/02

to

* Matthew X. Economou

| Is there an implementation-independent way to determine if a character is
| considered whitespace?

What are you going to do with the result?

--
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

Matthew X. Economou

unread,

Oct 12, 2002, 3:25:49 PM10/12/02

to

>>>>> "Erik" == Erik Naggum <er...@naggum.no> writes:

Erik> What are you going to do with the result?

I'm writing a library function that parses an IP address embedded in a
string. I'm using PARSE-INTEGER as a model for the function's
behavior. In addition to being able to operate on sub-strings and
ignoring junk, PARSE-INTEGER ignores leading and trailing whitespace,
and I'd like to do the same, using the same definition of whitespace
as the hosting Lisp implementation if possible.

Is this a reasonable thing to do?

--
Matthew X. Economou <xeno...@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)

"If it's not on fire, it's a software problem." --Carrie Fish

Johannes Grødem

unread,

Oct 12, 2002, 5:38:38 PM10/12/02

to

* "Matthew X. Economou" <xenopho...@irtnog.org>:

> Am I missing anything obvious? Will I just have to roll my own
> predicate?

I've tried to find this as well, but with no luck. I use the
following to mean white-space, though:

(#\Tab #\Newline #\Linefeed #\Page #\Return #\Space)

I guess there might be cases where want some of these not to count as
whitespace.

(I got these from the table in section 2.1.4 of the HyperSpec. Those
are the characters listed as having whitespace-syntax type.)

--
Johannes Grødem <OpenPGP: 5055654C>

Johannes Grødem

unread,

Oct 12, 2002, 5:46:42 PM10/12/02

to

* "Matthew X. Economou" <xenopho...@irtnog.org>:

> Am I missing anything obvious? Will I just have to roll my own
> predicate?

I've tried to find this as well, but with no luck. I use the

following to mean white-space, though:

(#\Tab #\Newline #\Linefeed #\Page #\Return #\Space)

I guess there might be cases where you want some of these not to count
as whitespace.

(I got these from the table in section 2.1.4 of the HyperSpec. These

Erik Naggum

unread,

Oct 12, 2002, 6:32:11 PM10/12/02

to

* Matthew X. Economou

| I'm writing a library function that parses an IP address embedded in a
| string.

Since an IP address may be several different things, I think the function
should be separated into two parts: one that searches for an IP address
(however defined: IPv4, IPv6, abbreviated or full), and several functions
that accept whatever passes for IP addresses and return the appropriate
address structure. I have found that I need CIDR coding with both /n and
/mask, but in other cases, /port is used. Sometimes, even .port is used
(which does not work with abbreviated IP addresses), although I consider
the smartest choice to be :port with IPv4 and /port with IPv6. When you
make this separation of functionality, there should be no need to know
what the whitespace characters are. Actually processing everything that
people do with IP addresses is fascinatingly complex. Many losers have
no concern for parsability of the output from their programs. *sigh*

Surprisingly often, wanting to know if you look at a whitespace character
means that you have chosen a less-than-ideal approach to the solution.
If you parse using a stream, `peek-char´ has a skip-whitespace option.

Matthew X. Economou

unread,

Oct 12, 2002, 6:14:18 PM10/12/02

to

>>>>> "Johannes" == Johannes Grødem <joh...@ifi.uio.no> writes:

Johannes> (I got these from the table in section 2.1.4 of the
Johannes> HyperSpec. These are the characters listed as having
Johannes> whitespace-syntax type.)

This gave me an idea. Since I'm consciously trying to mimic the
behavior of PARSE-INTEGER, especially its ability to parse substrings
via the START and END arguments, I have to manually track my position
within the string. It would be a lot easier to treat the string as a
string stream via WITH-INPUT-FROM-STRING, as I get both substrings and
bounds checking for free with streams.

The other nice thing this gives me is PEEK-CHAR, which with a peek
type of T, peeks ahead to the first non-whitespace character in the
stream.

I definitely need this behavior at the start of parsing, and I think I
can make it work to end parsing.

Thanks for the help! I'll be sure to post the code when I'm done.

Matthew X. Economou

unread,

Oct 13, 2002, 12:59:25 PM10/13/02

to

>>>>> "Erik" == Erik Naggum <er...@naggum.no> writes:

Erik> Since an IP address may be several different things, I think
Erik> the function should be separated into two parts: one that
Erik> searches for an IP address (however defined: IPv4, IPv6,
Erik> abbreviated or full), and several functions that accept
Erik> whatever passes for IP addresses and return the appropriate
Erik> address structure.

I think I'm on the right track. The code I'm writing now (the
PARSE-ADDRESS function) handles only IPv4 dotted-quads. I thought it
would be a lower-level function suitable for use in a reader macro (or
other user-input routine), just as PARSE-INTEGER seems to be used by
READ.

Erik> Actually processing everything that people do with IP
Erik> addresses is fascinatingly complex.

I didn't realize how complicated it could be until I took a look at
the source code to the IP address parsing routines in several
different operating systems and resolver libraries.

Erik> Surprisingly often, wanting to know if you look at a
Erik> whitespace character means that you have chosen a
Erik> less-than-ideal approach to the solution. If you parse
Erik> using a stream, `peek-char´ has a skip-whitespace option.

I was processing the input string character by character, instead of
converting it to a string-stream.

--
Matthew X. Economou <xeno...@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)

Erik Naggum

unread,

Oct 13, 2002, 7:28:21 PM10/13/02

to

* Matthew X. Economou

| I thought it would be a lower-level function suitable for use in a reader
| macro (or other user-input routine), just as PARSE-INTEGER seems to be
| used by READ.

`parse-integerด is not used by `readด.

| I didn't realize how complicated it could be until I took a look at the
| source code to the IP address parsing routines in several different
| operating systems and resolver libraries.

No kidding. People do so many horrible things you could cry.

| I was processing the input string character by character, instead of
| converting it to a string-stream.

Ideally, a string-stream should be better from all perspectives, but is
often much more expensive than need be.

Tim Bradshaw

unread,

Oct 13, 2002, 7:42:04 PM10/13/02

to

* Matthew X Economou wrote:

String streams are COOL and should be used for almost everything.
Real Lisp Programmers use string streams instead of lists. If your
implementor makes string streams expensive, complain vigorously.

(half serious)

--tim

Pekka P. Pirinen

unread,

Oct 14, 2002, 1:18:00 PM10/14/02

to

"Matthew X. Economou" <xenopho...@irtnog.org> writes:
> Is there an implementation-independent way to determine if a character
> is considered whitespace? I'm looking for the equivalent of the
> isspace() function in the standard C library,

Considered by whom? The thing is, it depends. In the standard,
there's the whitespace syntax type, and then there's whitespace(1),
which is independent of the readtable (and there's no standard way to
determine if a character is either of those). But if you're parsing
some non-CL syntax, that's the wrong place to look; you should look at
the definition of that syntax. And then you should spare a moment to
think about possible extension to character sets other than ASCII: Do
you want, e.g., U+00A0 No-Break Space or U+3000 Ideographic Space?
--
Pekka P. Pirinen
In cyberspace, everybody can hear you scream. - Gary Lewandowski