how to tokenize a string in Lisp

cartercc

unread,

Feb 1, 2009, 11:09:57 PM2/1/09

to

Sorry, guys, but I'm just a Lisp newbie. My main language is Perl and
in my day job I do a lot of data transformation. In Perl, I can do
this -

$string = '"Mr.","John","J.","Jones","Jr."';
@array = split /","/, $string;
print "ARRAY => @array\n";
($title, $first, $middle, $last, $suffix) = split /","/, $string;
print "SCALARS => ($title, $first, $middle, $last, $suffix)\n";

which creates a comma quoted string and tokenizes it twice, once as an
array and as a series of five scalar variables. This outputs the
following:

ARRAY => "Mr. John J. Jones Jr."
SCALARS => ("Mr., John, J., Jones, Jr.")

How would you tokenize a comma quoted string in Lisp?

TIA, CC

mil...@gmail.com

unread,

Feb 1, 2009, 11:44:27 PM2/1/09

to

http://www.cliki.net/SPLIT-SEQUENCE
btw. sometimes google is better and faster way to find answers to
trivial questions like this one, you just need google, or meaby
cliki.net
I dont belive that someone who want to learn basic things need to ask
on group

Message has been deleted

William James

unread,

Feb 2, 2009, 7:54:01 AM2/2/09

to

Madhu wrote:

>
>
> * cartercc
> Wrote on Sun, 1 Feb 2009 20:09:57 -0800 (PST):

>
> > Sorry, guys, but I'm just a Lisp newbie. My main language is Perl
> > and in my day job I do a lot of data transformation. In Perl, I
> > can do this -
> >
> > $string = '"Mr.","John","J.","Jones","Jr."';
>

> Typically one would use one of any many CSV file parsers (you can
> search CLL for these) in lisp to handle such data. However in this
> case the strings can be readily parsed by the Common Lisp reader,
> which can be trivially modified to handle the commas.
>
> This is perhaps a more advanced technique not suitable for the rank
> newbie, nevertherless....
>
> I assume the actual data comes from a file like this:
>
> # cat <<_EOF > file.text

> "Mr.","John","J.","Jones","Jr.

> "Mrs.","Williams","H.","Penguin","Sr."
> _EOF
>
> The following code illustrates how you can copy a readtable
> ($COMMA-RT), change the syntax for the #\, character, and wrap up
> calls to CL:READ after binding CL:*READTABLE* so as to parse the file
> into a list of lists.
>
> (in-package :cl-user)
>
> (defparameter *comma-rt* (copy-readtable nil))
> (set-syntax-from-char #\, #\Space *comma-rt*)
>
> (defun read-comma-separated-file (file)
> (let ((*read-eval* nil)
> (*readtable* *comma-rt*))
> (with-open-file (stream file)
> (loop for line = (read-line stream nil)
> while line
> when
> (with-input-from-string (stream line)
> (loop for elem = (READ stream nil)
> while elem
> collect elem))
> collect it))))
>
> This function would then be used as :
>
> * (read-comma-separated-file "file.text")
>
> => (("Mr." "John" "J." "Jones" "Jr.")
> ("Mrs." "Williams" "H." "Penguin" "Sr."))
>
> If you are serious about learning, you should look up the
> specifications of each new function in the standated (ANS, common
> lisp hyperspec) to see exactly what it is defined to do.

p IO.readlines("file.text").map{|line|
line.gsub(/^"|"\n/,"").split /","/ }

--- output ---
[["Mr.", "John", "J.", "Jones", "Jr."],
["Mrs.", "Williams", "H.", "Penguin", "Sr."]]

cartercc

unread,

Feb 2, 2009, 9:25:19 AM2/2/09

to

On Feb 2, 12:15 am, Madhu <enom...@meer.net> wrote:
> Typically one would use one of any many CSV file parsers (you can search
> CLL for these) in lisp to handle such data. However in this case the
> strings can be readily parsed by the Common Lisp reader, which can be
> trivially modified to handle the commas.
>
> This is perhaps a more advanced technique not suitable for the rank
> newbie, nevertherless....
>
> I assume the actual data comes from a file like this:

You are absolutely correct! I have very little discretionary time, and
am attempting to learn Lisp in about ten minutes a day, but if I can
start using Lisp to do real work, then I have an excuse to use it
during work time. At work my typical practice is to open an infile and
an outfile, read each row from the infile, manipulate each row, and
write it to the outfile. This is a typical Perl routine for doing
this:
--------------typical Perl routine----------------
open INFILE, "<", "infile.txt";
open OUTFILE, ">", "outfile.txt";
while (<INFILE>) #reads each row until EOF
}
chomp; #removes the newline from the row
$_ =~ s/^"//; #removes the leading double quote char
$_ =~ s/"$//; #removes the trailing double quote char
@row = split /","/; #reads each datum into an array element
#manipulate data as necessary to accomplish task
#when finished with data transformation, then
$row = join ,",", @row; #creates a comma quoted string from @row
print OUTFILE $row, "\n"; #writes $row to OUTFILE with newline
}
close OUTFILE;
close INFILE;

My motive was to break this process down into bite sized chunks, and
you hosed me! That's okay, but it's really too advanced for what I
need now.

Here's the deal: I'm used to dealing with individual items of data,
like ID numbers, email addresses, names, dates, all kinds of numbers
and identifiers, etc. I assign these to variables so I can manipulate
them in an automated fashion, as in a loop. Your solution does
tokenize the input string, which is what I asked, but I also need to
assign the individual elements to variables so that I can learn how to
manipulate the elements. IOW, I don't need a list, but an array or a
set of scalars that I can manipulate.

I guess that I should have asked how to tokenize a string AND ASSIGN
EACH TOKEN TO ITS OWN VARIABLE. Anyway, when I have time, I'm going to
try your solution and milanj's, and report back. It will probably be
several days. In the meantime, thanks for your response.

CC

>
> # cat <<_EOF > file.text

> "Mr.","John","J.","Jones","Jr.

> --
> Madhu

Tamas K Papp

unread,

Feb 2, 2009, 9:39:40 AM2/2/09

to

On Mon, 02 Feb 2009 06:25:19 -0800, cartercc wrote:

> attempting to learn Lisp in about ten minutes a day, but if I can start

That ain't gonna work, you are wasting your time. There is a minimal
fixed cost (in time) if you want to learn anything, if you don't have
that, it is futile and you will just be frustrated.

Tamas

TomSW

unread,

Feb 2, 2009, 9:44:26 AM2/2/09

to

On Feb 2, 3:25 pm, cartercc <carte...@gmail.com> wrote:

> during work time. At work my typical practice is to open an infile and
> an outfile, read each row from the infile, manipulate each row, and
> write it to the outfile. This is a typical Perl routine for doing
> this:

Another typical Perl technique is to have a look in CPAN to see if
anyone's already created a more robust version of the wheel you're
considering inventing :)

There's no real equivalent of CPAN for Lisp, google and cliki.net come
closest. Google turned up something called "csv-parser" that (what
d'you know) includes a macro to iterate through csv records and bind
the fields to variables.

Also check out destructuring-bind:

Perl:
($name, $rank, $number) = @array ;

Lisp:
(destructuring-bind (name rank number)
list
...
)

destructuring-bind can destructure extended lambda lists, which can be
a lot more complex than simple lists.

Espen Vestre

unread,

Feb 2, 2009, 9:55:02 AM2/2/09

to

But I think 70 minutes once a week would be way better than ten minutes
a day. From trying to do difficult things while commuting, I'd say that
20 minutes is the absolutely minimal time unit. Below that, and you
waste most of the time trying to figure out where you were.
--
(espen)

cartercc

unread,

Feb 2, 2009, 10:20:46 AM2/2/09

to

The expression 'ten minutes' was not meant to be taken literally, but
figuratively. I'm working my way through Wilensky, and my goal is to
read a chunk of the text and WRITE(!) some Lisp code daily.

There's no way you can learn a skill without practice, whether it be
cooking, golf, playing a musical instrument, or programming.
Practicing (at least) ten minutes a day is much better than practicing
zero minutes a day, and daily practice means you can build on
yesterday's progress. It's obviously better to spend more time
developing a skill than less time, but the way to do that is to reach
a point where you are no longer practicing but performing, and the way
to do that is to turn practice into performance.

CC.

gtod

unread,

Feb 2, 2009, 12:32:58 PM2/2/09

to

On Feb 2, 2:25 pm, cartercc <carte...@gmail.com> wrote:
> I guess that I should have asked how to tokenize a string AND ASSIGN
> EACH TOKEN TO ITS OWN VARIABLE.

I assume you have installed cl-ppcre. It provides a Perl like regular
expression library for Common Lisp (CL).

Line starting with semicolons (;) are comments in CL. I show the line
in Perl as a comment first, then the similar CL line.

Unfortunately there is no built in way in CL to avoid escaping the
double quote characters in the string. However, usually we would read
such a line from a file so this ugliness would not be apparent. Also,
we could define a read macro to make this neater if we needed to do it
often...

;; $string = '"Mr.","John","J.","Jones","Jr."';
(let ((string "\"Mr.\",\"John\",\"J.\",\"Jones\",\"Jr.\""))
;; @array = split /","/, $string;
(let ((row (cl-ppcre:split "\",\"" string)))
;; print "ARRAY => @array\n";
(format t "ARRAY => ~A~%" row)
;; ($title, $first, $middle, $last, $suffix) = split /","/,
$string;
(destructuring-bind (title first middle last suffix)
(cl-ppcre:split "\",\"" string)
;; print "SCALARS => ($title, $first, $middle, $last, $suffix)
\n";
(format t "SCALARS => (~A, ~A, ~A, ~A, ~A)~%"
title first middle last suffix))))

This is not an ideal solution to your larger problem, but it is a
direct answer to your question which I hope you find useful.

I programmed in Perl for many years and have been learning Lisp these
last few years. Perl was built (in part) to work easily with files
and strings and regular expressions. If we think of the domain of
text munging as a dirt track then Perl is like a BMX bicycle, designed
to work well off road. Maybe Ruby is a smaller, flashier sort of
bicycle that people make 'ooh' noises at as it zips past.

So because of the need to mandate cl-ppcre, and to qualify the
ugliness of escaping lots of double quotes it might appear that Lisp
is not as suitable as Perl for text munging. But that is only true
for as long as it takes you to work out that Lisp is not yet another
type of bicycle but rather a 50 year old man who has been making
bicycles with his bare hands since before you were born.

It took me a long time to work this out and there is no short cut.
But I think it is worth it.

cartercc

unread,

Feb 2, 2009, 1:08:32 PM2/2/09

to

gtod,

Thanks, this was exactly what I needed to see. I understand that it's
not idiomatic Lisp, but it does show me something about Lisp (much as
an interlinear translation of Cicero isn't idiomatic English, but it
does show how Latin relates to English).

Two things:

(1) Perl was created for exactly this kind of work while Lisp wasn't,
and it makes sense that a tool made for a particular job will do that
job better than a superior tool NOT made for a particular job. I like
your comparison of data munging (which is exactly what I do) to a dirt
track. Lisp may help me manufacture data mungers, which is one reason
why I am learning it.

(2) I don't have cl-ppcre. If I have trouble getting it, I'll
certainly ask for help.

Thanks, CC.

Thomas A. Russ

unread,

Feb 3, 2009, 9:26:08 PM2/3/09

to

cartercc <cart...@gmail.com> writes:

> I guess that I should have asked how to tokenize a string AND ASSIGN
> EACH TOKEN TO ITS OWN VARIABLE. Anyway, when I have time, I'm going to
> try your solution and milanj's, and report back. It will probably be
> several days. In the meantime, thanks for your response.

Well, unless there is a fixed format, or else a very limited number of
potential tokens in the string, you really don't want to assign each
token to its own variable.

I would expect that having a list or vector would be a lot more useful.
The list if you expect to processin items serially and the vector if you
want to have random or indexed access to the elements.

It certainly seems a lot clearer to me to have vector (array) accessors
for indexed items rather than variables named arg1, arg2, arg3, arg4,
etc. Variables would only make sense if you had fixed meanings for the
positions in the vector. Even then, it might make more sense to still
use a vector and write your own accessor functions or macros:

(defmacro get-title (data) `(aref ,data 0))
(defmacro get-first-name (data) `(aref ,data 1))

--
Thomas A. Russ, USC/Information Sciences Institute

Rob Warnock

unread,

Feb 3, 2009, 11:46:26 PM2/3/09

to

cartercc <cart...@gmail.com> wrote:
+---------------

| (2) I don't have cl-ppcre. If I have trouble getting it, I'll
| certainly ask for help.

+---------------

See <http://www.cliki.net/CL-PPCRE>.

Also look at the following CL functions, which you *do* already have:

POSITION
SEARCH
MISMATCH
PARSE-INTEGER
SUBSEQ
REPLACE
CONCATENATE

To get the most from these, you will need to read & understand
the sections in the CLHS about "bounding index designators"
and the :START and :END [and sometimes :START2 and :END2]
keyword arguments which nearly all sequence functions take.
Also learn about the :KEY and :TEST keyword arguments, again,
which nearly all sequence functions take. [Oh, and :FROM-END, too.]

Additional hints:

- Coming from a C or Perl world, you may find the following bits
of syntactic sugar helpful:

(defun strcat (&rest strings)
(apply #'concatenate 'string strings))

(define-compiler-macro strcat (&rest strings)
`(concatenate 'string ,@strings))

(defun join (delimiter &rest strings)
(apply #'concatenate 'string
(if (zerop (length delimiter)) ; If explicit "" or NIL.
strings ; do short-circuit optimization.
(loop for s on strings ; Long way.
collect (car s)
when (cdr s)
collect delimiter))))

- MISMATCH is one of more underappreciated string-bashing functions
in CL, since it actually tells you how much *was* matched. ;-}
Very useful [especially with the :START2/:END2 options] to
tell whether a (possibly-abbreviated) fixed substring exists
at some specific location in a string, *without* having to do
a SUBSEQ first to extract the portion to be tested. [Avoids
unnecessary consing.]

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

webmasterATflymagnetic.com

unread,

Feb 4, 2009, 8:44:43 PM2/4/09

to

Sorry but I can't agree. It is precisely answers to questions like
this, and the Lisp code examples that seem so difficult to find, that
help to sell the Lisp idea to newbies. Reading code is a vital part of
learning a language, and this group is good at dishing out useful
snippets.

A Google search for Lisp code repositories is, frankly, useless. The
links that appear are 15-20 years old, and don't exist anymore. So
keep up the good work guys. Answers to 'trivial' questions like this
are what will help bring new blood into the Lisp fold. Being sniffy
and saying it's beneath this group is just silly.

If there is a positive value to be got from the replies, then the OP
was correct.

jos...@corporate-world.lisp.de

unread,

Feb 4, 2009, 9:33:55 PM2/4/09

to

On Feb 5, 2:44 am, "webmasterATflymagnetic.com"

http://common-lisp.net/projects.shtml

adw-charting
ait
alexandria
anaphora
ansi-test
araneida
armedbear
armish
asdf
asdf-addons
asdf-binary-locations
asdf-install
asdf-packaging
asdf-system-connections
aspectl
babel
bayescl
bdb
beep
beirc
berkeley-db
bese
bknr
bordeaux-threads
boston-lisp
bpm
bytemap
c2ffi
caleb
cdr
cello
cells
cells-gtk
cells-ode
cffi
chemboy
cinline
cl-applescript
cl-berkeley-db
cl-bio
cl-blog
cl-buchberger
cl-bzip2
cl-cactus-kev
cl-cairo2
cl-captcha
cl-carbon
cl-cli-parser
cl-clickatell
cl-colors
cl-component
cl-cont
cl-containers
cl-couch
cl-cracklib
cl-curl
cl-darcs
cl-darx
cl-date-calc
cl-def
cl-dises
cl-dwim
cl-emb
cl-enumeration
cl-facebook
cl-fltk
cl-ftp
cl-fuse
cl-gd
cl-gdbm
cl-glpk
cl-godb
cl-graph
cl-graphviz
cl-gsl
cl-interpol
cl-ipc
cl-irc
cl-irregsexp
cl-jpeg
cl-json
cl-kanren-trs
cl-l10n
cl-lazy-list
cl-lexer
cl-librarian
cl-libtai
cl-magick
cl-markdown
cl-match
cl-mathstats
cl-memcached
cl-menusystem
cl-migrations
cl-mp3-parse
cl-mpd
cl-muproc
cl-ncurses
cl-net-snmp
cl-objc
cl-octave
cl-ode
cl-opengl
cl-openid
cl-pdf
cl-peg
cl-perec
cl-plplot
cl-plus-ssl
cl-pop
cl-ppcre
cl-prevalence
cl-quasi-quote
cl-rdbms
cl-rope
cl-sbml
cl-screen
cl-selenium
cl-semantic
cl-serializer
cl-smogames
cl-smtp
cl-snmp
cl-soap
cl-sockets
cl-sqlite
cl-stm
cl-store
cl-syntax-sugar
cl-syslog
cl-taint
cl-telnetd
cl-trane
cl-typesetting
cl-unification
cl-uri
cl-utilities
cl-variates
cl-walker
cl-wav-synth
cl-wdim
cl-weblocks
cl-who
cl-wiki
cl-x86-asm
cl-xml
cl-xmlspam
cl-xmms
cl-xmpp
clTcl
clappa
clarity
claw
clazy
clbuild
cldoc
clfswm
clget
clhp
clim-desktop
climacs
climplayer
clkd
clnuplot
clo
closer
closure
clotnet
clouchdb
cloud
clpython
clsql-fluid
clsql-mysql-introspect
cmucl
common-math
computed-class
core-services
corman-sdl
cparse
crypticl
css-sexp
cxml
decl
defclass-star
defdoc
defeditor
definer
defmud
defplayer
defwm
docudown
docutrack
drakma
dynamic-classes
dyslexia
ecl
ecl-readline
eclipse
editor-hints
elephant
encline
erlang-in-lisp
erlisp
external-program
fetter
flexi-streams
flexichain
fomus
fret
fset
ftd
fucc
funds
gamelib
ganelon
geco
gecol
geohash
geometry
glouton
gnucard
grand-prix
graphic-forms
gsharp
gsll
gzip-stream
ht-ajax
html-template
hyperdoc
hyperspec-lookup
iaxphone
ieee-floats
ieeefp-tests
imago
innen
iolib
isidorus
iso8601-date
iterate
iterate-clsql
jess-parse
jnil
kpax
lambda-gtk
lgtk
liards
lifp
lift
limp
lineal
linedit
lisp-on-lines
lisp-res-kit
lispbox
lispfaq
lisppaste
lispy
lmud
local-time
log4cl
log5
lost+found
lw-vim-mode
macho
macondoolap
mcclim
mel-base
meta-cvs
metabang-bind
metacopy
metatilities
metatilities-base
misc-extensions
misrouted
mod-lisp
modisc
moptilities
morphologie
movies
movitz
names-and-paths
nio
nixies
noctool
nrw-xmcl
nxtlisp
objective-cl
oct
openair
osicat
pal
parenscript
parse-declarations
patg
patty
pg
pg-introspect
phemlock
phorplay
plain-odbc
plexippus-xpath
ply
portage-overlay
postmodern
pretty-function
progintellect
protobuf
py-configparser
python-on-lisp
qitab
quiz
rcl
rclg
rdnzl
rfc2109
rfc2388
rfc2822
rjain-utils
rucksack
s-xml
s-xml-rpc
same
sapaclisp
sb-simd
sexpc
simple-http
slime
snmp1
sparklines
spray
sqlisp
ssc
stamp
stdutil
steeldump
stefil
suave
submarine
tbnl
the-feebs-war
tinaa
tioga
trivial-backtrace
trivial-freeimage
trivial-http
trivial-iconv
trivial-shell
trivial-timeout
trivial-utf-8
ubf
ucs-sort
ucw
ucw-extras
umpa-lumpa
unetwork
usocket
vial
wispylisp
xcvb
xml-psychiatrist
xmls
xuriella
yason
zip
zlib

webmasterATflymagnetic.com

unread,

Feb 5, 2009, 12:54:59 PM2/5/09

to

On Feb 5, 2:33 am, "jos...@corporate-world.lisp.de" <jos...@corporate-

Cool -- that worked. ;-)

Joel J. Adamson <adamsonj>

unread,

Feb 5, 2009, 11:41:08 AM2/5/09

to

On 2009-02-05, webmasterATflymagnetic.com <webm...@flymagnetic.com> wrote:
> On Feb 2, 4:44 am, "mil...@gmail.com" <mil...@gmail.com> wrote:
>> On Feb 2, 5:09 am, cartercc <carte...@gmail.com> wrote:
>>
>> > How would you tokenize a comma quoted string in Lisp?
>>

> Sorry but I can't agree. It is precisely answers to questions like
> this, and the Lisp code examples that seem so difficult to find, that
> help to sell the Lisp idea to newbies. Reading code is a vital part of
> learning a language, and this group is good at dishing out useful
> snippets.

I agree with you but for a different reason: look at some books on
Lisp and you'll find examples of parsing strings. This is a basic
part of ELIZA (only one classic Lisp program). There is also a great
example of this in doctor.el that comes with GNU Emacs. The reason it
may not be posted on anybody's site is that people who've learned Lisp
pedagogically (from books or by direct instruction) think this is
really basic.

Here's one way to do it (and yes "there's more than one way to do it")

(defun get-file-as-strings (file)
"Collect the lines of a file as strings."
(with-open-file
(infile file)
(loop for line = (read-line infile nil nil) while line collect
line)))

(defun string-to-read-list (strng &key (comment-char #\;))
"Take a string and read it as a list"
(if (with-input-from-string (st strng)
(eq (peek-char t st)
comment-char))
nil
(read-from-string (concatenate 'string "(" strng
")"))))

Now your string is a list of tokens, but a slight modification would
tokenize it into individual strings. From there on you can use
`read' to deal with the tokens.

@webmaster: Now, as to my agreement: it's examples like this that
should show beginners that Lisp deals with things fundamentally
different from other languages. It's taken me a long time to get used
to it. Perl is already set up for text-processing: you can make Lisp
do that, as much as you can make it do anything else, but you are the
one who decides how to do it. What "newbies" need to understand is
"there is no spoon." As Paul Graham has said, Lisp is not so much a
programming language as it is an algorithmic abstraction.
"Quick-and-dirty" is not a good way to learn.

@OP: Read On Lisp, Practical Common Lisp, Paradigms in Artificial
Intelligence Programming and keep a copy of CLTL2 on hand, and after a
while things will come to you. Spending a little time with Scheme
might also be a good idea.

Joel

--
Joel J. Adamson -- http://www.unc.edu/~adamsonj
University of North Carolina at Chapel Hill
CB #3280, Coker Hall
Chapel Hill, NC 27599-3280

WJ

unread,

Feb 17, 2011, 7:24:38 AM2/17/11

to

Rob Warnock wrote:

> (defun join (delimiter &rest strings)
> (apply #'concatenate 'string
> (if (zerop (length delimiter)) ; If explicit "" or NIL.
> strings ; do short-circuit optimization.
> (loop for s on strings ; Long way.
> collect (car s)
> when (cdr s)
> collect delimiter))))

guile> (string-join '("a" "b") "--")
"a--b"
guile> (string-join '("a" "b") "--" 'suffix)
"a--b--"
guile> (string-join '("a" "b") "--" 'prefix)
"--a--b"

WJ

unread,

Feb 17, 2011, 7:50:56 AM2/17/11

to

cartercc wrote:

Using Guile, and assuming that no field contains a comma:

guile> (string-tokenize s (char-set-complement (->char-set ",\"")))
("Mr." "John" "J." "Jones" "Jr.")

WJ

unread,

May 3, 2011, 12:50:00 PM5/3/11

to

cartercc wrote:

Arc:

arc> (tokens "foo,bar,baz" #\,)
("foo" "bar" "baz")