[Sbcl-devel] on an unsettling Lisp versus Python 'competition' or why isn't Lisp doing better here...?

245 views
Skip to first unread message

Gary King

unread,
Nov 21, 2007, 12:33:53 PM11/21/07
to sbcl-...@lists.sourceforge.net
I needed to process a text file of about 220-MB in size where each
line is of the form:

ID~Attribute~lots of text...

I want to read in the file, split the lines at the #\~ and then make
hash-tables containing the data. If I use the straight-forward Lisp
approach (code is included below), LispWorks runs repeatedly in about
12-seconds; SBCL runs in about 26-seconds but cannot run a second
time because it enters the LDB during GC. Python (code below), on the
other hand, runs in 8-seconds and I can run it over and over again.

Note that I can easily rewrite the code so that string-consing isn't
such a bear (this also increases speed). It is also possible to muck
around with memory settings, tune the code, tune the GC, etc.

What is hard is to understand:

* why Python runs so well "out of the box" where Lisp runs so poorly;

* how to explain to people that want to get work done and want to use
Lisp that they have to write more convoluted code;

* how to explain why Lisp lets itself get into a place where it
cannot recover memory. All of the memory allocated during the call to
parse-text is garbage; it can all be freed. But both SBCL and Allegro
are unable to freed the memory without running out of memory during
the process.

In other words, it's easy to fix this particular problem but it
leaves a bad taste in the mouth and doesn't explain why vanilla-Lisp
is such a poor performer. Surely, other people process great big gobs
of text. What are they doing right that the code below does wrong?

The data file can be found on http://www.metabang.com/resources/fake-
data.txt.gz.

Any comments or clarity that the SBCL folks can shed would be super
appreciated...

thanks,


Lisp code:

(defun parse-text (filename)
(with-open-file (in filename
:direction :input
:if-does-not-exist :error)
(let ((ht (make-hash-table :test 'equal)))
(loop for line = (read-line in nil)
while line do
(let ((fields (split-sequence #\~ line)))
(when (= (length fields) 3)
(let ((id (string-trim " " (first fields)))
(attribute (string-trim " " (second fields)))
(value (string-trim " " (third fields))))
(when (not (gethash id ht))
(setf (gethash id ht) (make-hash-table :test 'equal)))
(let ((fields-ht (gethash id ht)))
(setf (gethash attribute fields-ht) value))))))
(print (hash-table-count ht))
nil)))

Python code:

import string, time

def parsetext (filename):
file = open(filename)
dict = {}
while 1:
line = file.readline()
if line == "":
break
splt = string.split(line, "~")
if len(splt) != 3:
print line
else:
(id, attribute, value) = splt
id = string.strip(id)
attribute = string.strip(attribute)
value = string.strip(value)
if not dict.has_key(id):
dict[id] = {}
field_dict = dict[id]
field_dict[attribute] = value
file.close()
None


--
Gary Warren King, metabang.com
Cell: (413) 559 8738
Fax: (206) 338-4052
gwkkwg on Skype * garethsan on AIM

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Zach Beane

unread,
Nov 21, 2007, 12:50:00 PM11/21/07
to sbcl-...@lists.sourceforge.net
On Wed, Nov 21, 2007 at 12:33:53PM -0500, Gary King wrote:
> I needed to process a text file of about 220-MB in size where each
> line is of the form:
>
> ID~Attribute~lots of text...
>
> I want to read in the file, split the lines at the #\~ and then make
> hash-tables containing the data.

Is that the end goal? If not, what's the actual end goal? What you
give here looks like a description of a particular solution to an
unstated goal, but maybe there's a better Lisp way to achieve the goal
without doing something that looks exactly like this.

Zach

Peter Eddy

unread,
Nov 21, 2007, 1:27:04 PM11/21/07
to Zach Beane, sbcl-...@lists.sourceforge.net
On Nov 21, 2007 12:50 PM, Zach Beane <xa...@xach.com> wrote:
> On Wed, Nov 21, 2007 at 12:33:53PM -0500, Gary King wrote:
> > I needed to process a text file of about 220-MB in size where each
> > line is of the form:
> >
> > ID~Attribute~lots of text...
> >
> > I want to read in the file, split the lines at the #\~ and then make
> > hash-tables containing the data.
>
> Is that the end goal? If not, what's the actual end goal? What you
> give here looks like a description of a particular solution to an
> unstated goal, but maybe there's a better Lisp way to achieve the goal
> without doing something that looks exactly like this.

<newbie de-lurking> Hi!

Does it really matter if it's the end goal or not? Shouldn't one be
able to do this in lisp?

I ran into similar issues trying to process large-ish time series data
in the past, and I switched to Python for the same reason, though I
would really have rather not had to.

- Peter

Zach Beane

unread,
Nov 21, 2007, 1:37:27 PM11/21/07
to sbcl-...@lists.sourceforge.net
On Wed, Nov 21, 2007 at 01:27:04PM -0500, Peter Eddy wrote:

> <newbie de-lurking> Hi!
>
> Does it really matter if it's the end goal or not? Shouldn't one be
> able to do this in lisp?
>
> I ran into similar issues trying to process large-ish time series data
> in the past, and I switched to Python for the same reason, though I
> would really have rather not had to.

The difference is between "this task can't be done in Lisp!" and "the
first technique I thought of doesn't work well in Lisp!" I think it
matters, and there are ways to think of solutions that map to a Lisp
implementation's strengths instead of its weaknesses.

Zach

Nikodemus Siivola

unread,
Nov 21, 2007, 1:48:41 PM11/21/07
to Gary King, sbcl-...@lists.sourceforge.net
On Nov 21, 2007 5:33 PM, Gary King <gwk...@metabang.com> wrote:

> I needed to process a text file of about 220-MB in size where each
> line is of the form:
>
> ID~Attribute~lots of text...
>
> I want to read in the file, split the lines at the #\~ and then make
> hash-tables containing the data. If I use the straight-forward Lisp
> approach (code is included below), LispWorks runs repeatedly in about
> 12-seconds; SBCL runs in about 26-seconds but cannot run a second
> time because it enters the LDB during GC. Python (code below), on the
> other hand, runs in 8-seconds and I can run it over and over again.

Caveat: I have not thought about this properly yet -- these are just my
first thoughts.

* What is the encoding actually used? You are using the default encoding of
your locale below.

* How does Python represent strings? The strings you are using below are
full unicode strings. If the file is actually ASCII, you're wasting
a lot of space
-- using base-strings should do better.

* You're allocating a lot of hash-tables. If Python uses lighter-weight objects
there, then the difference can be just that.

* Which version of SBCL is this? There was a period around 1.0.7 where
GETHASH and (SETF GETHASH) consed -- which has since then been fixed.

Cheers,

-- Nikodemus

Tobias C. Rittweiler

unread,
Nov 21, 2007, 2:01:06 PM11/21/07
to sbcl-...@lists.sourceforge.net
Gary King <gwk...@metabang.com> writes:

> I needed to process a text file of about 220-MB in size where each
> line is of the form:
>
> ID~Attribute~lots of text...
>
> I want to read in the file, split the lines at the #\~ and then make
> hash-tables containing the data. If I use the straight-forward Lisp
> approach (code is included below), LispWorks runs repeatedly in about
> 12-seconds; SBCL runs in about 26-seconds but cannot run a second
> time because it enters the LDB during GC. Python (code below), on the
> other hand, runs in 8-seconds and I can run it over and over again.

Does Python's readline() handle Unicode by default? If not, and the file
consists only of ASCII characters, I'd imagine a tremendeously speed up
if you pass :ELEMENT-TYPE 'base-char, and declare the lines read to be
of type BASE-STRING and perhaps also specifies to compile for SPEED.

-T.

Sidney Markowitz

unread,
Nov 21, 2007, 3:04:13 PM11/21/07
to sbcl-...@lists.sourceforge.net
Gary King wrote, On 22/11/07 6:33 AM:

> I needed to process a text file of about 220-MB in size where each
[...]

> approach (code is included below), LispWorks runs repeatedly in about
> 12-seconds; SBCL runs in about 26-seconds but cannot run a second
> time because it enters the LDB during GC. Python (code below), on the
> other hand, runs in 8-seconds and I can run it over and over again.

So your "out-of-box" experience with LispWorks using the first approach
you think of is 12 seconds instead of 8 seconds. That does not seem bad
in comparing two languages. If the purpose is to decide which language
is more suitable for a large text processing application, that is close
enough to decide that either might be suitable, but doesn't tell you how
performance will compare in a complete performance-tweaked finished
application.

SBCL shows it did not handle the test case. But as other people pointed
out you are using unicode strings by default. If you were running on an
x86 the default heap size that is compiled into sbcl is 512MB. A 220MB
input file whose contents converted to unicode is all going into a hash
table is going to just about fill up the available heap space, and I can
imagine that you did fill it up. How does the performance compare if you
use only half the input file? It might be nice if sbcl degraded better
in that situation, but it is not a Lisp vs Python language issue.

I don't have an x86-64 handy. I wonder how that would fare on this test,
as I think the default heap size in that version of sbcl is much bigger.

-- Sidney Markowitz
http://www.sidney.com

Ingvar

unread,
Nov 21, 2007, 3:42:51 PM11/21/07
to Peter Eddy, sbcl-...@lists.sourceforge.net, ing...@hexapodia.net
> On Nov 21, 2007 12:50 PM, Zach Beane <xa...@xach.com> wrote:
> > On Wed, Nov 21, 2007 at 12:33:53PM -0500, Gary King wrote:
> > > I needed to process a text file of about 220-MB in size where each
> > > line is of the form:
> > >
> > > ID~Attribute~lots of text...
> > >
> > > I want to read in the file, split the lines at the #\~ and then make
> > > hash-tables containing the data.
> >
> > Is that the end goal? If not, what's the actual end goal? What you
> > give here looks like a description of a particular solution to an
> > unstated goal, but maybe there's a better Lisp way to achieve the goal
> > without doing something that looks exactly like this.
>
> <newbie de-lurking> Hi!
>
> Does it really matter if it's the end goal or not? Shouldn't one be
> able to do this in lisp?

Well, if there is another way of doing it, that circumvents the "running out
of memory" problem, it may be applicable to other problems.

Only half-ready code I have that's mangling any serious amount of data only
inhales 900-odd KB of assorted inter-related specs and turns it into data
structures, so I suppose any data from that is skewed by being "not very
large".

//Ingvar

Gary King

unread,
Nov 21, 2007, 3:48:21 PM11/21/07
to Zach Beane, sbcl-...@lists.sourceforge.net, David J. Neu, Ar...@sc8-sf-spam2.sourceforge.net
> Is that the end goal? If not, what's the actual end goal?

The end goal is to load a whole bunch of string data into memory
where it can be easily mucked about (e.g., frequency counts, machine
learning algorithms, etc...).

> unicode or ASCII

This is surely part of the problem and probably a bit more caveat
emptor on my part would have been a good thing.

The nub of this issue here is not Python versus Lisp. I would expect
any language / platform combination to eventually say "no more" and
run out of memory if I try to feed it too much data. There are two
things I find bothersome.

1. (Common) Lisp doesn't include any convenient destructive string
operations or even a buffered read-line approach (Franz has one now
but it's not standard). Having these around would, I think, be a good
thing.

2. Both SBCL and ACL got into a position where their garbage
collectors could not get unstuck even though I knew that everything I
had consed was garbage. Maybe it's just wishful thinking on my part,
but if I know that the stuff I've created is garbage, I want Lisp to
figure that out and let me keep going.

> SBCL version / bigger heap, etc.

I'm not sure of the version but I think it was recent...

I'm sure a larger heap would help.

Thanks for the comments so far! Please keep them coming. <smile>


--
Gary Warren King, metabang.com
Cell: (413) 559 8738
Fax: (206) 338-4052
gwkkwg on Skype * garethsan on AIM

-------------------------------------------------------------------------

Sidney Markowitz

unread,
Nov 21, 2007, 4:21:44 PM11/21/07
to sbcl-...@lists.sourceforge.net
Gary King wrote, On 22/11/07 9:48 AM:

> 2. Both SBCL and ACL got into a position where their garbage
> collectors could not get unstuck even though I knew that everything I
> had consed was garbage. Maybe it's just wishful thinking on my part,
> but if I know that the stuff I've created is garbage, I want Lisp to
> figure that out and let me keep going.

I got an interesting result trying it on about the first half of the
test data file.

After calling parse-text once (room) showed about 512MB of heap space
used. (gc :full t) got rid of that.

It does seem that something is not right that the garbage doesn't go
away and sbcl dies when you don't have the explicit gc between
evaluations of the parse-text function.

-- Sidney Markowitz
http://www.sidney.com

-------------------------------------------------------------------------

Nathan Froyd

unread,
Nov 21, 2007, 4:22:21 PM11/21/07
to Gary King, sbcl-...@lists.sourceforge.net
On Nov 21, 2007 3:48 PM, Gary King <gwk...@metabang.com> wrote:
> > unicode or ASCII
>
> This is surely part of the problem and probably a bit more caveat
> emptor on my part would have been a good thing.

On the other hand, SBCL is not as efficient as it could be even in
cases when you do as suggested earlier (asking for BASE-CHAR streams
and single-byte encodings)--although following that advice would
definitely help. So there is improvement to be made here.

> 1. (Common) Lisp doesn't include any convenient destructive string
> operations or even a buffered read-line approach (Franz has one now
> but it's not standard). Having these around would, I think, be a good
> thing.

The Python code you posted earlier doesn't use destructive string
operations--indeed, it can't, due to the nature of Python strings.

One technique suggested a year or two ago--by Ingvar, I think, and
probably by people before him as well--is to load large chunks of data
into memory at one go and use the omnipresent START and END keyword
arguments to sequence functions to muck around in those chunks. This
might not help much in your case, since you'd be pulling individual
strings out of those large chunks anyway, but it would help cut down
on consing a little bit. Being smarter about pulling bits out of
lines would help too. (I guess this is roughly the buffered read-line
approach you advocate above...)

-Nathan

Robert J. Macomber

unread,
Nov 21, 2007, 4:47:58 PM11/21/07
to sbcl-...@lists.sourceforge.net
On Wed, Nov 21, 2007 at 04:22:21PM -0500, Nathan Froyd wrote:
> One technique suggested a year or two ago--by Ingvar, I think, and
> probably by people before him as well--is to load large chunks of data
> into memory at one go and use the omnipresent START and END keyword
> arguments to sequence functions to muck around in those chunks. This
> might not help much in your case, since you'd be pulling individual
> strings out of those large chunks anyway, but it would help cut down
> on consing a little bit. Being smarter about pulling bits out of
> lines would help too. (I guess this is roughly the buffered read-line
> approach you advocate above...)

If you don't absolutely need simple-strings, you can pull get
low-overhead substrings from a big one by using make-array with
:displaced-to. This is (IME) frequently easier than passing around
START and END everywhere.
--
Robert Macomber
sf-s...@rojoma.com

Peter Eddy

unread,
Nov 21, 2007, 5:10:02 PM11/21/07
to Zach Beane, sbcl-...@lists.sourceforge.net
On Nov 21, 2007 1:37 PM, Zach Beane <xa...@xach.com> wrote:

> The difference is between "this task can't be done in Lisp!" and "the
> first technique I thought of doesn't work well in Lisp!" I think it
> matters, and there are ways to think of solutions that map to a Lisp
> implementation's strengths instead of its weaknesses.

Sorry, I should have been clearer. The unsettling part of this for me
is that the test could only be run once. Not the speed or memory
consumption.

Marco Antoniotti

unread,
Nov 21, 2007, 4:52:38 PM11/21/07
to sbcl-devel
Hi

one of the problems that you have is that SPLIT-SEQUENCE is a
"generic" operation. Python split operation(s) works on strings.
POSITION is also a "generic" operation. However, I think that
beating POSITION is hard.

The following code is faster than SPLIT-SEQUENCE (YMMV: it is on LW)
and it is especially faster when you stop after COUNT splits, which,
from what you say, is your case.

==================================
(deftype seq-index ()
`(integer 0 ,array-dimension-limit)) ; Check this!


(defun f-string-split (str (delim #\,) &optional (count array-
dimension-limit))
(declare (type string str)
(type character delim)
(type seq-index count))
(loop for n of-type seq-index from 1
for keep-going of-type boolean = (<= n count)
for start of-type seq-index = 0 then (1+ dpos)
for dpos of-type (or null seq-index) = (and keep-going
(position delim
str :test #'char= :start start))
collect (subseq str start (and keep-going dpos))
while (and keep-going dpos)
))
===================================

as per the memory problems and the Unicode issues, I cannot say much.

Cheers

--
Marco

--
Marco Antoniotti, Associate Professor
DISCo, Università Milano Bicocca U14 2043
Viale Sarca 336
I-20126 Milan (MI) ITALY

Please note that I am not checking my Spam-box anymore.

Juho Snellman

unread,
Nov 21, 2007, 5:39:29 PM11/21/07
to Peter Eddy, sbcl-...@lists.sourceforge.net
"Peter Eddy" <peter...@gmail.com> writes:

> On Nov 21, 2007 1:37 PM, Zach Beane <xa...@xach.com> wrote:
>
> > The difference is between "this task can't be done in Lisp!" and "the
> > first technique I thought of doesn't work well in Lisp!" I think it
> > matters, and there are ways to think of solutions that map to a Lisp
> > implementation's strengths instead of its weaknesses.
>
> Sorry, I should have been clearer. The unsettling part of this for me
> is that the test could only be run once. Not the speed or memory
> consumption.

It is not a surprise, but a consequence of how the mostly copying
generational gc works. The ways of tuning the GC to avoid this have
been discussed on the list before.

As for the slowness issue, since everybody seems to be making wild
guesses about the reasons without actually doing any profiling, I'll
throw in one more. The average line length of the input data is larger
than the fd-stream input buffer. Thus we always take the slow path
through READ-LINE. Based on some old notes it looks like the slow path
is 5x slower than the fast one, which would probably make READ-LINE
the main bottleneck.

The easiest way to verify this theory would probably be to bump the
size of +ANSI-STREAM-IN-BUFFER-LENGTH+ to something like 8k, and
recompile SBCL. If that helps, it should not be hard to rewrite
READ-LINE to get rid of the slow path completely.

--
Juho Snellman

Richard M Kreuter

unread,
Nov 21, 2007, 5:59:26 PM11/21/07
to Gary King, sbcl-...@lists.sourceforge.net, David J. Neu, Ar...@sc8-sf-spam2.sourceforge.net
Gary King writes:

> > unicode or ASCII
>
> This is surely part of the problem and probably a bit more caveat
> emptor on my part would have been a good thing.

No, it's the only problem, at least as far as Lisp running out of memory
goes.

> The nub of this issue here is not Python versus Lisp. I would expect
> any language / platform combination to eventually say "no more" and
> run out of memory if I try to feed it too much data. There are two
> things I find bothersome.
>
> 1. (Common) Lisp doesn't include any convenient destructive string
> operations or even a buffered read-line approach (Franz has one now
> but it's not standard). Having these around would, I think, be a good
> thing.

That won't help your current program at all, for two reasons: first, you
must allocate distinct strings for your hash keys and values, or else
you'd be destructively modifying hash keys and values, which has
undefined consequences. Second, you're saving the last attribute/value
pair for each id/attribute pair, and so in the limit case (such as the
data file you've posted), where each id/attribute pair is unique, you're
holding onto everything but the inter-field whitespace and delimiters.

Notice that you're doing exactly the same thing in Python:

$ cat test.py
#!/usr/bin/python

import sys
import string, time
import os

def parsetext (filename):
file = open(filename)
dict = {}
while 1:
line = file.readline()
if line == "":
break
splt = string.split(line, "~")
if len(splt) != 3:
print line
else:
(id, attribute, value) = splt
id = string.strip(id)
attribute = string.strip(attribute)
value = string.strip(value)
if not dict.has_key(id):
dict[id] = {}
field_dict = dict[id]
field_dict[attribute] = value
file.close()
None

parsetext(sys.argv[1])
# How are we doing?
os.system('ps u '+str(os.getpid()))

$ ./test.py fake-data.txt
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
kreuter 4818 48.5 18.9 200484 196316 pts/7 S+ 16:37 0:11 /usr/bin/python
^^^^^^ ^^^^^^
in KiB in KiB

The only significant difference between SBCL and Python here is that
SBCL's CHARACTERs occupy 4 octets each, and so by storing the entire
file's text in memory at once, you run out of memory. So you'll run out
of heap just the same in Python for files within a small scalar multiple
of the size of your sample (probably around 5x).

> 2. Both SBCL and ACL got into a position where their garbage
> collectors could not get unstuck even though I knew that everything I
> had consed was garbage. Maybe it's just wishful thinking on my part,
> but if I know that the stuff I've created is garbage, I want Lisp to
> figure that out and let me keep going.

It's simply not true that any of the data in your program is garbage
(unreachable from the root set; sometimes called "syntactic garbage")
until after the function returns. The runtime can't possibly know
whether you will replace any value or remove any assocation in any of
the hash tables you're accumulating until after the routine returns. So
you really can't expect a runtime to do anything here.

(You might expect the compiler to be able to figure out that it doesn't
need to construct the hash tables inside the loop, but that's quite a
leap of inference.)

--
RmK

jwie...@gmail.com

unread,
Nov 21, 2007, 10:46:19 PM11/21/07
to jo...@newartisans.com
On Nov 21, 2007, at 1:33 PM, Gary King wrote:

> * why Python runs so well "out of the box" where Lisp runs so poorly;

> Gary Warren King, metabang.com

I gave a try at implementing another solution to your problem, this
one optimized for the specific circumstance you described. On my
machine, the memory usage of the SBCL process never exceeded 34 Mb RSS
during the entire run. It still doesn't go as fast as the Python
solution, but as others mentioned, this may be due to Unicode support
in the SBCL case.

Anyway, here are the reference times on my system for Python:

Hermes:/Users/johnw/dl $ time python scanner.py fake-data.txt
python scanner.py fake-data.txt 5.29s user 0.85s system 99% cpu
6.155 total

And here is the data for SBCL:

Evaluation took:
9.949 seconds of real time
8.908618 seconds of user run time
0.843395 seconds of system run time
[Run times include 0.63 seconds GC run time.]
0 calls to %EVAL
0 page faults and
3,164,974,352 bytes consed.

The solution is below. As you can see, thanks to using SERIES the
code is very straightforward. The `declare' statement is the only
trick part, needed to give sufficient hints to the SBCL optimizer (you
get a warning it ought to be added, so this was not black magic).

John

(declaim (optimize (safety 0) (debug 0) (speed 3) (space 0)))

(defpackage :scanner
(:use :common-lisp :series))

(in-package :scanner)

(series::install)

(defun parse-text (filename)
(print
(hash-table-count
(multiple-value-bind (keys values)
(map-fn
'(values string string)
#'(lambda (line)
(declare (type (simple-array character (*)) line))
(let ((first-tilde (position #\~ line))
(last-tilde (1+ (position #\~ line :from-end t))))
(values (subseq line (1+ first-tilde) last-tilde)
(subseq line (1+ last-tilde)))))
(scan-file filename #'read-line))
(collect-hash keys values :test #'equal)))))

Richard M Kreuter

unread,
Nov 22, 2007, 12:19:35 AM11/22/07
to Nikodemus Siivola, sbcl-...@lists.sourceforge.net
"Nikodemus Siivola" writes:

> * How does Python represent strings? The strings you are using below are
> full unicode strings. If the file is actually ASCII, you're wasting
> a lot of space -- using base-strings should do better.

Is there a convenient way to get base-strings from a stream? It looks
as though specifying an element-type of BASE-CHAR to OPEN gets a stream
whose element-type is CHARACTER, and that ANSI-STREAM-READ-LINE has no
provision for constructing base-strings to return. Is this the intended
behavior?

If not, the hack below seems sufficient to get READ-LINE to return
base-strings on streams opened with (OPEN ... :ELEMENT-TYPE 'BASE-CHAR).
(Of course, if the file contains extended-chars, errors will be
signalled.) Is this worth putting in after this month's release?

--
RmK

Index: src/code/fd-stream.lisp
===================================================================
RCS file: /cvsroot/sbcl/sbcl/src/code/fd-stream.lisp,v
retrieving revision 1.120
diff -u -r1.120 fd-stream.lisp
--- src/code/fd-stream.lisp 11 Oct 2007 13:13:08 -0000 1.120
+++ src/code/fd-stream.lisp 22 Nov 2007 05:06:45 -0000
@@ -1077,7 +1077,7 @@
(when (member external-format (first entry))
(return-from pick-input-routine
(values (symbol-function (third entry))
- 'character
+ type
1
(symbol-function (second entry))
(first (first entry)))))))
@@ -2333,7 +2333,7 @@
(defun open (filename
&key
(direction :input)
- (element-type 'base-char)
+ (element-type 'character)
(if-exists nil if-exists-given)
(if-does-not-exist nil if-does-not-exist-given)
(external-format :default)
Index: src/code/stream.lisp
===================================================================
RCS file: /cvsroot/sbcl/sbcl/src/code/stream.lisp,v
retrieving revision 1.94
diff -u -r1.94 stream.lisp
--- src/code/stream.lisp 10 Sep 2007 13:31:45 -0000 1.94
+++ src/code/stream.lisp 22 Nov 2007 05:06:45 -0000
@@ -253,14 +253,15 @@
:start %frc-index%)))
(when pos
(let* ((len (- pos %frc-index%))
- (res (make-string len)))
+ (res (make-string len :element-type (ansi-stream-element-type stream))))
(replace res %frc-buffer% :start2 %frc-index% :end2 pos)
(setf %frc-index% (1+ pos))
(done-with-fast-read-char)
(return-from ansi-stream-read-line res))))))
- (let ((res (make-string 80))
- (len 80)
- (index 0))
+ (let* ((element-type (ansi-stream-element-type stream))
+ (res (make-string 80 :element-type element-type))
+ (len 80)
+ (index 0))
(loop
(let ((ch (fast-read-char nil nil)))
(cond (ch
@@ -269,7 +270,7 @@
(return (values (%shrink-vector res index) nil)))
(when (= index len)
(setq len (* len 2))
- (let ((new (make-string len)))
+ (let ((new (make-string len :element-type element-type)))
(replace new res)
(setq res new)))
(setf (schar res index) ch)

Paul Khuong

unread,
Nov 22, 2007, 2:31:16 AM11/22/07
to Richard M Kreuter, sbcl-...@lists.sourceforge.net
On 11/22/07, Richard M Kreuter <kre...@progn.net> wrote:
> "Nikodemus Siivola" writes:
>
> > * How does Python represent strings? The strings you are using below are
> > full unicode strings. If the file is actually ASCII, you're wasting
> > a lot of space -- using base-strings should do better.
>
> Is there a convenient way to get base-strings from a stream? It looks
> as though specifying an element-type of BASE-CHAR to OPEN gets a stream
> whose element-type is CHARACTER, and that ANSI-STREAM-READ-LINE has no
> provision for constructing base-strings to return. Is this the intended
> behavior?
>
> If not, the hack below seems sufficient to get READ-LINE to return
> base-strings on streams opened with (OPEN ... :ELEMENT-TYPE 'BASE-CHAR).
> (Of course, if the file contains extended-chars, errors will be
> signalled.) Is this worth putting in after this month's release?

This seems like a bad thing: strings returned by read-line can be
side-effected. Having read-line sometimes return normal strings (in
which any character can be stored) and sometimes base-strings (which
have restrictions on the element characters) depending on the input
stream's element-type, which is quite a dynamic property, would make
type propagation and writing correct code harder. If it is actually
useful, a completely separate function would be a better idea.

Paul Khuong

John Wiegley

unread,
Nov 22, 2007, 2:38:02 AM11/22/07
to sbcl-...@lists.sourceforge.net
On Nov 21, 2007, at 1:33 PM, Gary King wrote:

> * why Python runs so well "out of the box" where Lisp runs so poorly;

> Gary Warren King, metabang.com

John

(in-package :scanner)

(series::install)

-------------------------------------------------------------------------

Ingvar

unread,
Nov 22, 2007, 2:45:02 AM11/22/07
to sbcl-...@lists.sourceforge.net
Paul Khuong writes:
> On 11/22/07, Richard M Kreuter <kre...@progn.net> wrote:
> > "Nikodemus Siivola" writes:
> >
> > > * How does Python represent strings? The strings you are using below are
> > > full unicode strings. If the file is actually ASCII, you're wasting
> > > a lot of space -- using base-strings should do better.
> >
> > Is there a convenient way to get base-strings from a stream? It looks
> > as though specifying an element-type of BASE-CHAR to OPEN gets a stream
> > whose element-type is CHARACTER, and that ANSI-STREAM-READ-LINE has no
> > provision for constructing base-strings to return. Is this the intended
> > behavior?
> >
> > If not, the hack below seems sufficient to get READ-LINE to return
> > base-strings on streams opened with (OPEN ... :ELEMENT-TYPE 'BASE-CHAR).
> > (Of course, if the file contains extended-chars, errors will be
> > signalled.) Is this worth putting in after this month's release?
>
> This seems like a bad thing: strings returned by read-line can be
> side-effected. Having read-line sometimes return normal strings (in
> which any character can be stored) and sometimes base-strings (which
> have restrictions on the element characters) depending on the input
> stream's element-type, which is quite a dynamic property, would make
> type propagation and writing correct code harder. If it is actually
> useful, a completely separate function would be a better idea.

To be honest, I thought READ-LINE did return base-strings when reading from a
file with element-type BASE-CHAR and strings when reading from files filled
with unrestricted characters. I don't have any code whose correctness relies
on this, but I do have code that uses more than necessary storage that way (of
course, I only specify element-types where I know I need them).

I think this is one of those cases where one can argue about what is expected,
what is correct and what is best for teh implementor and not reaching any firm
conclusion.

//Ingvar

Christophe Rhodes

unread,
Nov 22, 2007, 2:55:43 AM11/22/07
to Richard M Kreuter, sbcl-...@lists.sourceforge.net
Richard M Kreuter <kre...@progn.net> writes:

> If not, the hack below seems sufficient to get READ-LINE to return
> base-strings on streams opened with (OPEN ... :ELEMENT-TYPE 'BASE-CHAR).
> (Of course, if the file contains extended-chars, errors will be
> signalled.) Is this worth putting in after this month's release?

I suspect that this will be slow....

> - (res (make-string len)))
> + (res (make-string len :element-type (ansi-stream-element-type stream))))
> (replace res %frc-buffer% :start2 %frc-index% :end2 pos)

... because this will now cause the full slow path to REPLACE to be
taken, I think even after Juho's new transforms. Juho, instead of
(simple-base-string simple-character-string)
(simple-character-string simple-base-string)
would it be sensible to have
(simple-string simple-string)
and have some string-dispatch calls inside the expansion?

Cheers,

Christophe

Bruno Daniel

unread,
Nov 22, 2007, 4:21:51 AM11/22/07
to Juho Snellman, sbcl-...@lists.sourceforge.net
Juho Snellman writes:
> As for the slowness issue, since everybody seems to be making wild
> guesses about the reasons without actually doing any profiling, I'll
> throw in one more.

Here's the result SBCL's statistical profiler shows on my 64 bit machine
(The statistics are reproducible up to fluctuations of about 1%):

(sb-sprof:with-profiling (:max-samples 10000 :report :flat :loop nil)
(parse-text "/localdata/daniel/fake-data.txt"))

Profiler sample vector full (2280 traces / 100000 samples), doubling the size

86735

Number of samples: 2999
Sample interval: 0.01 seconds
Total sampling time: 29.99 seconds
Number of cycles: 0

Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 540 18.0 540 18.0 540 18.0 - (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24)
2 335 11.2 1424 47.5 875 29.2 - SB-IMPL::VECTOR-SUBSEQ*
3 329 11.0 501 16.7 1204 40.1 - READ-LINE
4 276 9.2 656 21.9 1480 49.3 - SB-KERNEL:%FIND-POSITION
5 189 6.3 189 6.3 1669 55.7 - SB-KERNEL:HAIRY-DATA-VECTOR-SET
6 186 6.2 186 6.2 1855 61.9 - SB-IMPL::OPTIMIZED-DATA-VECTOR-SET
7 176 5.9 176 5.9 2031 67.7 - SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
8 166 5.5 166 5.5 2197 73.3 - SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
9 150 5.0 150 5.0 2347 78.3 - EQL
10 106 3.5 137 4.6 2453 81.8 - SB-IMPL::FD-STREAM-READ-N-CHARACTERS/LATIN-1
11 93 3.1 93 3.1 2546 84.9 - SB-KERNEL:HAIRY-DATA-VECTOR-REF
12 87 2.9 87 2.9 2633 87.8 - SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
13 83 2.8 97 3.2 2716 90.6 - MAKE-ARRAY
14 55 1.8 55 1.8 2771 92.4 - SB-VM::GENERIC-+
15 52 1.7 52 1.7 2823 94.1 - IDENTITY
16 27 0.9 27 0.9 2850 95.0 - SB-KERNEL:%SHRINK-VECTOR
17 21 0.7 1427 47.6 2871 95.7 - SPLIT-SEQUENCE:SPLIT-SEQUENCE
18 12 0.4 21 0.7 2883 96.1 - SB-IMPL::GETHASH3
19 12 0.4 36 1.2 2895 96.5 - FIND
20 9 0.3 768 25.6 2904 96.8 - STRING-TRIM
21 8 0.3 656 21.9 2912 97.1 - POSITION
22 8 0.3 9 0.3 2920 97.4 - SUBSEQ
23 7 0.2 17 0.6 2927 97.6 - MAKE-HASH-TABLE
24 7 0.2 7 0.2 2934 97.8 - SB-KERNEL:UB32-BASH-COPY
25 7 0.2 7 0.2 2941 98.1 - TRUNCATE
26 6 0.2 6 0.2 2947 98.3 - SB-KERNEL:SEQUENCEP
27 6 0.2 7 0.2 2953 98.5 - CHAR=
28 4 0.1 2999 100.0 2957 98.6 - PARSE-TEXT
29 4 0.1 11 0.4 2961 98.7 - CEILING
30 4 0.1 9 0.3 2965 98.9 - SB-IMPL::%MAKE-HASH-TABLE
31 4 0.1 4 0.1 2969 99.0 - ARRAY-ELEMENT-TYPE
32 3 0.1 6 0.2 2972 99.1 - SB-IMPL::EQUAL-HASH
33 3 0.1 6 0.2 2975 99.2 - SB-KERNEL:%PUTHASH
34 3 0.1 3 0.1 2978 99.3 - NCONC
35 3 0.1 3 0.1 2981 99.4 - SB-IMPL::%VECTOR-WIDETAG-AND-N-BITS
36 2 0.1 2 0.1 2983 99.5 - SB-KERNEL:%SP-STRING-COMPARE
37 1 0.03 138 4.6 2984 99.5 - SB-INT:FAST-READ-CHAR-REFILL
38 1 0.03 2 0.1 2985 99.5 - SB-KERNEL:%SXHASH-SIMPLE-STRING
39 1 0.03 1 0.03 2986 99.6 - SB-IMPL::SYSREAD-MAY-BLOCK-P
40 1 0.03 1 0.03 2987 99.6 - (FLET SB-IMPL::TRICK)
41 1 0.03 3 0.1 2988 99.6 - SB-KERNEL:STRING=*
42 1 0.03 5 0.2 2989 99.7 - EQUAL
43 1 0.03 1 0.03 2990 99.7 - (LABELS SB-IMPL::EQUAL-AUX)
44 1 0.03 1 0.03 2991 99.7 - SB-KERNEL:%COERCE-CALLABLE-TO-FUN
45 1 0.03 1 0.03 2992 99.8 - (LABELS SB-IMPL::SXHASH-RECURSE)
46 1 0.03 1 0.03 2993 99.8 - LENGTH
47 1 0.03 1 0.03 2994 99.8 - SB-IMPL::CEIL-POWER-OF-TWO
48 0 0.0 2999 100.0 2994 99.8 - "Unknown component: #x1002A77190"
49 0 0.0 2999 100.0 2994 99.8 - EXECUTE-TEST
50 0 0.0 2999 100.0 2994 99.8 - TRY-IMPL
51 0 0.0 2999 100.0 2994 99.8 - SB-INT:SIMPLE-EVAL-IN-LEXENV
52 0 0.0 2999 100.0 2994 99.8 - "Unknown component: #x1004D06A60"
53 0 0.0 2999 100.0 2994 99.8 - (SB-PCL::FAST-METHOD SWANK-BACKEND:CALL-WITH-SYNTAX-HOOKS (T))
54 0 0.0 2999 100.0 2994 99.8 - SWANK::CALL-WITH-BUFFER-SYNTAX
55 0 0.0 2999 100.0 2994 99.8 - "Unknown component: #x10051954D0"
56 0 0.0 2999 100.0 2994 99.8 - (SB-PCL::FAST-METHOD SWANK-BACKEND:CALL-WITH-DEBUGGER-HOOK (T T))
57 0 0.0 2999 100.0 2994 99.8 - "Unknown component: #x1004E5D4A0"
58 0 0.0 2999 100.0 2994 99.8 - SWANK::CALL-WITH-REDIRECTED-IO
59 0 0.0 2999 100.0 2994 99.8 - SWANK::CALL-WITH-CONNECTION
60 0 0.0 2999 100.0 2994 99.8 - SWANK::HANDLE-REQUEST
61 0 0.0 2999 100.0 2994 99.8 - SWANK::PROCESS-AVAILABLE-INPUT
62 0 0.0 2999 100.0 2994 99.8 - (FLET SWANK::HANDLER)
63 0 0.0 2999 100.0 2994 99.8 - (LAMBDA (SWANK-BACKEND::_))
64 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::SUB-SERVE-EVENT
65 0 0.0 2999 100.0 2994 99.8 - SB-SYS:WAIT-UNTIL-FD-USABLE
66 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::REFILL-INPUT-BUFFER
67 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::INPUT-CHAR/LATIN-1
68 0 0.0 2999 100.0 2994 99.8 - READ-CHAR
69 0 0.0 2999 100.0 2994 99.8 - READ-PRESERVING-WHITESPACE
70 0 0.0 2999 100.0 2994 99.8 - READ
71 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::REPL-READ-FORM-FUN
72 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::REPL-FUN
73 0 0.0 2999 100.0 2994 99.8 - "Unknown component: #x1000081E20"
74 0 0.0 2999 100.0 2994 99.8 - (LAMBDA NIL)
75 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::%WITH-REBOUND-IO-SYNTAX
76 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::TOPLEVEL-REPL
77 0 0.0 2999 100.0 2994 99.8 - SB-IMPL::TOPLEVEL-INIT
78 0 0.0 2999 100.0 2994 99.8 - (LABELS SB-IMPL::RESTART-LISP)
79 0 0.0 540 18.0 2994 99.8 - SB-UNIX::CALL-WITHOUT-INTERRUPTS
80 0 0.0 511 17.0 2994 99.8 - SB-KERNEL:SUB-GC
81 0 0.0 26 0.9 2994 99.8 - SB-SYS:INVOKE-INTERRUPTION
82 0 0.0 26 0.9 2994 99.8 - (FLET SB-UNIX::RUN-HANDLER)
83 0 0.0 2 0.1 2994 99.8 - SB-THREAD::CALL-WITH-RECURSIVE-SYSTEM-SPINLOCK
84 0 0.0 1 0.03 2994 99.8 - SB-THREAD::CALL-WITH-SYSTEM-MUTEX
85 0 0.0 1 0.03 2994 99.8 - SB-KERNEL:RUN-PENDING-FINALIZERS
------------------------------------------------------------------------
5 0.2 elsewhere

Best regards
Bruno Daniel

Stefan Lang

unread,
Nov 22, 2007, 6:01:34 AM11/22/07
to sbcl-...@lists.sourceforge.net
On Donnerstag, 22. November 2007, Bruno Daniel wrote:
> Juho Snellman writes:
> > As for the slowness issue, since everybody seems to be making
> > wild guesses about the reasons without actually doing any
> > profiling, I'll throw in one more.
>
> Here's the result SBCL's statistical profiler shows on my 64 bit
> machine (The statistics are reproducible up to fluctuations of
> about 1%):

Since SUBSEQ shows up very high on the list:
I've improved one or two entries on the Great Computer
Language Shootout (they are gone, alioth lost a few days
data around that time) and on the way encountered that some
sequence functions, including SUBSEQ, can be improved by an
order of magnitude or more for simple array types.

In my experience this is the main reason why direct
ports from Ruby/Python code perform bad under SBCL.

Stefan

Attila Lendvai

unread,
Nov 22, 2007, 9:19:20 AM11/22/07
to sbcl-...@lists.sourceforge.net
hi,

i've optimized it a bit which i think is in par with the python speed
if not faster now. two things are attached to the mail: one is the
optimized split-sequence and the other is the diff to sbcl HEAD. this
latter diff contains the previously posted base-char changes to
read-line and an optimization to string-trim. (the big change in
stream.lisp is only the make-result-string macrolet and the (declare
(type index len index)) type declaration).

i was working with a smaller version of the file to keep it in the
disk cache and avoid swapping. it went down from 9.927 to 2.597.

the final form of the defun:

(defun parse-text (&optional (filename "/home/ati/fake-data.txt"))
(declare (optimize speed (debug 0))
(inline split-sequence:split-sequence
string-trim))
(with-open-file (in filename
:element-type 'base-char
:external-format :ascii


:direction :input
:if-does-not-exist :error)
(let ((ht (make-hash-table :test 'equal)))
(loop

for line of-type simple-base-string = (read-line in nil)
while line do
(let ((fields (split-sequence:split-sequence #\~ line)))


(when (= (length fields) 3)

(let ((id (string-trim " " (the simple-base-string
(first fields))))
(attribute (string-trim " " (the
simple-base-string (second fields))))
(value (string-trim " " (the simple-base-string
(third fields)))))


(when (not (gethash id ht))
(setf (gethash id ht) (make-hash-table :test 'equal)))
(let ((fields-ht (gethash id ht)))
(setf (gethash attribute fields-ht) value))))))
;;(print (hash-table-count ht))

(values))))


i've recorded the interesting steps:

CL-USER> (progn
(sb-ext:gc :full t)
(time (parse-text)))

*** unoptimized:

Evaluation took:
9.927 seconds of real time
9.056566 seconds of user run time
0.856054 seconds of system run time
[Run times include 2.352 seconds GC run time.]


0 calls to %EVAL
0 page faults and

1,633,159,856 bytes consed.


*** after applying the patch for read-line to return base-string's:

difference is below measuring error (but it's important for the
split-sequence inlining)


*** after adding (declare (optimize speed (debug 0)))

difference is below measuring error


*** after optimizing and inlining split-sequence (it even had apply
#'position calls!)

Evaluation took:
4.916 seconds of real time
4.504282 seconds of user run time
0.408026 seconds of system run time
[Run times include 1.044 seconds GC run time.]


0 calls to %EVAL
0 page faults and

796,540,592 bytes consed.


*** after adding optimization to string-trim that avoids calling
subseq if there's nothing to be trimmed (note: this is explicitly
allowed by the spec):

Evaluation took:
3.372 seconds of real time
2.940184 seconds of user run time
0.428027 seconds of system run time
[Run times include 1.048 seconds GC run time.]


0 calls to %EVAL
0 page faults and

728,696,000 bytes consed.


*** after inlining string-trim:

time difference is below measuring error, but consing went down a little.

Evaluation took:
3.28 seconds of real time
2.864179 seconds of user run time
0.396024 seconds of system run time
[Run times include 1.024 seconds GC run time.]


0 calls to %EVAL
0 page faults and

718,985,664 bytes consed.


*** annotating (the simple-base-string ...) for the args of the
string-trim calls:

Evaluation took:
3.107 seconds of real time
2.740172 seconds of user run time
0.360022 seconds of system run time
[Run times include 1.048 seconds GC run time.]


0 calls to %EVAL
0 page faults and

718,925,760 bytes consed.

*** at this point the profiling looks like this (sorry, it's
unreadable without fixed font, but plain text mails are preferred):

Rank Name Self% Cumul% Total%
1 READ-LINE 39.34 53.08 39.34
Callers
PARSE-TEXT 53.08
Calls
SB-INT:FAST-READ-CHAR-REFILL 9.00
SB-KERNEL:%SHRINK-VECTOR 3.32
SB-KERNEL:UB32-BASH-COPY 1.42
2 (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24) 24.64 24.64 63.98
3 SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII 7.58 9.00 71.56
Callers
SB-INT:FAST-READ-CHAR-REFILL 9.00
Calls
SB-IMPL::REFILL-INPUT-BUFFER 1.42
4 PARSE-TEXT 6.16 93.84 77.73
Callers
NIL 93.84
Calls
READ-LINE 53.08
SB-IMPL::GETHASH3 5.21
SB-VM::GENERIC-+ 4.74
MAKE-HASH-TABLE 3.32
SB-KERNEL:UB8-BASH-COPY 2.84
SB-KERNEL:%PUTHASH 0.95
5 SB-VM::GENERIC-+ 4.74 4.74 82.46
6 SB-KERNEL:%SHRINK-VECTOR 3.32 3.32 85.78
7 SB-KERNEL:UB8-BASH-COPY 2.84 2.84 88.63
8 SB-KERNEL:%SP-STRING-COMPARE 2.37 2.37 91.00
9 MAKE-HASH-TABLE 2.37 3.32 93.36
10 SB-KERNEL:STRING=* 1.42 3.79 94.79
11 SB-KERNEL:UB32-BASH-COPY 1.42 1.42 96.21
12 (FLET #:BODY-FUN-[GETHASH3]1076) 0.95 5.21 97.16
13 (FLET SB-IMPL::TRICK) 0.47 0.47 97.63
14 SB-IMPL::REFILL-INPUT-BUFFER 0.47 1.42 98.10
15 SB-IMPL::CEIL-POWER-OF-TWO 0.47 0.47 98.58
16 (FLET #:CLEANUP-FUN-[%PUTHASH]1409) 0.47 0.47 99.05
17 SB-KERNEL:%PUTHASH 0.47 0.95 99.53
18 SB-IMPL::%MAKE-HASH-TABLE 0.47 0.47 100.00

*** after rising +ansi-stream-in-buffer-length+ to 4096 and recompiling sbcl

note that consing fall down drastically but due to the problem with
REPLACE that Christoph mentioned it got slower. but hopefully Juho has
something to say here.

Evaluation took:
3.59 seconds of real time
3.272205 seconds of user run time
0.316019 seconds of system run time
[Run times include 0.528 seconds GC run time.]


0 calls to %EVAL
0 page faults and

260,260,240 bytes consed.

Rank Name Self% Cumul% Total%
1 REPLACE 16.42 49.25 16.42
Callers
READ-LINE 49.25
REPLACE 0.37
Calls
SB-IMPL::OPTIMIZED-DATA-VECTOR-SET 10.07
SB-KERNEL:HAIRY-DATA-VECTOR-SET 9.70
SB-KERNEL:HAIRY-DATA-VECTOR-REF 7.84
SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 2.61
SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 2.24
LENGTH 0.37
REPLACE 0.37
2 SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII 11.57 16.42 27.99
Callers
SB-INT:FAST-READ-CHAR-REFILL 16.42
Calls
SB-IMPL::REFILL-INPUT-BUFFER 4.85
3 (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24) 10.82 10.82 38.81
4 READ-LINE 10.07 76.12 48.88
Callers
PARSE-TEXT 76.12
Calls
REPLACE 49.25
SB-INT:FAST-READ-CHAR-REFILL 16.42
SB-KERNEL:%SHRINK-VECTOR 0.37
5 SB-IMPL::OPTIMIZED-DATA-VECTOR-SET 10.07 10.07 58.96
Callers
REPLACE 10.07
6 SB-KERNEL:HAIRY-DATA-VECTOR-SET 9.70 9.70 68.66
7 PARSE-TEXT 7.84 95.90 76.49
8 SB-KERNEL:HAIRY-DATA-VECTOR-REF 7.84 7.84 84.33
9 SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 2.61 2.61 86.94
10 SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 2.24 2.24 89.18
11 SB-KERNEL:UB8-BASH-COPY 1.87 1.87 91.04
12 SB-KERNEL:STRING=* 1.49 2.24 92.54
13 SB-KERNEL:%PUTHASH 0.75 1.12 93.28
14 SB-KERNEL:%SP-STRING-COMPARE 0.75 0.75 94.03


*** after getting rid of the REPLACE problem with an ugly hack in
ansi-stream-read-line that helps propagating the type to REPLACE:

Evaluation took:
2.597 seconds of real time
2.356147 seconds of user run time
0.240015 seconds of system run time
[Run times include 0.532 seconds GC run time.]


0 calls to %EVAL
0 page faults and

260,262,768 bytes consed.

Rank Name Self% Cumul% Total%
1 PARSE-TEXT 16.40 96.83 16.40
2 SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII 14.81 18.52 31.22
Callers
SB-INT:FAST-READ-CHAR-REFILL 18.52
Calls
SB-IMPL::REFILL-INPUT-BUFFER 3.70
3 READ-LINE 13.76 61.90 44.97
Callers
PARSE-TEXT 61.90
Calls
REPLACE 29.63
SB-INT:FAST-READ-CHAR-REFILL 18.52
4 (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24) 13.76 13.76 58.73
Callers
SB-UNIX::CALL-WITHOUT-INTERRUPTS 13.76
5 REPLACE 8.47 29.63 67.20
Callers
READ-LINE 29.63
REPLACE 0.53
Calls
SB-KERNEL:HAIRY-DATA-VECTOR-REF 7.41
SB-KERNEL:HAIRY-DATA-VECTOR-SET 5.82
SB-IMPL::OPTIMIZED-DATA-VECTOR-SET 3.70
SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 3.70
REPLACE 0.53
LENGTH 0.53
6 SB-KERNEL:HAIRY-DATA-VECTOR-REF 7.41 7.41 74.60
7 SB-KERNEL:HAIRY-DATA-VECTOR-SET 5.82 5.82 80.42
8 SB-KERNEL:UB8-BASH-COPY 4.23 4.23 84.66
9 SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 3.70 3.70 88.36
10 SB-IMPL::OPTIMIZED-DATA-VECTOR-SET 3.70 3.70 92.06
11 (FLET #:BODY-FUN-[%PUTHASH]1355) 1.59 2.12 93.65
12 SB-KERNEL:%SP-STRING-COMPARE 1.06 1.06 94.71
13 (FLET #:BODY-FUN-[GETHASH3]1076) 1.06 2.65 95.77
14 LENGTH 0.53 0.53 96.30
15 SB-KERNEL::%MAKE-INSTANCE-WITH-LAYOUT 0.53 0.53 96.83
16 SB-IMPL::REFILL-INPUT-BUFFER 0.53 3.70 97.35
17 SB-KERNEL:%UNARY-TRUNCATE 0.53 0.53 97.88
18 SB-THREAD::MAKE-SPINLOCK 0.53 0.53 98.41
19 (FLET SB-IMPL::TRICK) 0.53 0.53 98.94
20 SB-KERNEL:%SXHASH-SIMPLE-STRING 0.53 1.06 99.47


*** and an interesting last piece: using (ppcre:split "~" line
:sharedp t), adding a (declare (type simple-base-string
target-string)) to split and avoiding apply #'subseq, it's hardly
slower:

Evaluation took:
3.028 seconds of real time
2.772173 seconds of user run time
0.256016 seconds of system run time
[Run times include 0.556 seconds GC run time.]


0 calls to %EVAL
0 page faults and

269,862,768 bytes consed.

--
attila

split-sequence.lisp
sbcl.diff

Nathan Froyd

unread,
Nov 22, 2007, 5:34:06 PM11/22/07
to Stefan Lang, sbcl-...@lists.sourceforge.net
On Nov 22, 2007 6:01 AM, Stefan Lang <langs...@gmx.at> wrote:
> Since SUBSEQ shows up very high on the list:
> I've improved one or two entries on the Great Computer
> Language Shootout (they are gone, alioth lost a few days
> data around that time) and on the way encountered that some
> sequence functions, including SUBSEQ, can be improved by an
> order of magnitude or more for simple array types.
>
> In my experience this is the main reason why direct
> ports from Ruby/Python code perform bad under SBCL.

For better or for worse, SBCL's philosophy is that the general
sequence functions are, well, general--with all the type dispatching
and such that implies. If you want speed, then you need to declare
types at the call site. This is in contrast to many "scripting"
languages, where the library functions are going to be pretty speedy.

Thinking out loud: would it be worthwhile to specialize things like
SUBSEQ--particularly on simple arrays--in the generic library code,
similar to the way Juho optimized array access? Doing this sort of
optimization for REPLACE/MISMATCH/etc. would probably require too much
space and handling all the keyword arguments in functions like FIND or
POSITION might be tricky, but the benefits to scripting-language-esque
code might be worth it.

Failing that, we might try being a little more careful in optimizing
things like string functions--I know that the core function for string
comparison can be sped up quite a bit by doing type dispatch on the
type of strings prior to actually doing the comparison to eliminate
jumps in the inner loops, for example.

-Nathan

Nikodemus Siivola

unread,
Nov 23, 2007, 7:08:39 AM11/23/07
to Nathan Froyd, sbcl-...@lists.sourceforge.net
On Nov 22, 2007 10:34 PM, Nathan Froyd <fro...@gmail.com> wrote:

> Thinking out loud: would it be worthwhile to specialize things like
> SUBSEQ--particularly on simple arrays--in the generic library code,
> similar to the way Juho optimized array access? Doing this sort of
> optimization for REPLACE/MISMATCH/etc. would probably require too much
> space and handling all the keyword arguments in functions like FIND or
> POSITION might be tricky, but the benefits to scripting-language-esque
> code might be worth it.

If we convert first to %REPLACE &co with required arguments only, then
the keyword handling doesn't take any space...

Cheers,

-- Nikodemus

Gary King

unread,
Nov 25, 2007, 3:33:17 PM11/25/07
to Attila Lendvai, sbcl-...@lists.sourceforge.net
Hi Attila,

Thanks for all this work. Interesting notes.

 attila<split-sequence.lisp><sbcl.diff>-------------------------------------------------------------------------

David J. Neu

unread,
Nov 30, 2007, 6:03:35 PM11/30/07
to Gary King, sbcl-...@lists.sourceforge.net, Attila Lendvai
Hi Attila,

Thanks for all of your work!

We're very interested in improving the performance of the program that
Gary posted, and have tried some comparisions with and without your
changes, and using Python.

I used SBCL 1.0.12.7 from the git repo, and the fake-data.tgz file
that Gary provided.

Here's what I found:

1. Using the code as Gary provided:

Evaluation took:
22.975 seconds of real time
20.319805 seconds of user run time
2.187422 seconds of system run time
[Run times include 7.712 seconds GC run time.]


0 calls to %EVAL
0 page faults and

4,192,713,792 bytes consed.


2. Applying your sbcl.diff, using your split-sequence, and using your
version of parse-text, I got:

Evaluation took:
12.328 seconds of real time
11.457298 seconds of user run time
0.800819 seconds of system run time
[Run times include 2.122 seconds GC run time.]


0 calls to %EVAL
0 page faults and

911,974,696 bytes consed

3. Python: 4.188586 seconds


I'm wondering if you or anyone else had any thoughts on the following
questions:

a. While this is a /dramatic/ improvement:
- It's still much slower than Python, and you mentioned that your
results were comparable to Python. Any thoughts why? Did you get
the same results on Gary's fake-data.tgz?

- The reduction in consing in the example, was dramatic, but didn't
compare to your results. Again, any thoughts why? Did you get
the same results on Gary's fake-data.tgz?

b. If I run (sb-ext:gc :full t) after each run, I can repeatedly call
PARSE-TEXT with no issues, if I don't make the gc call, I get dropped
into the debugger, is this to be expected?

c. Juho had mentioned that tuning the garbage collector was discussed
on this list, would this address point b., and could someone point me
to the appropriate post?

I found this post
http://sourceforge.net/mailarchive/message.php?msg_id=l37iskysld.fsf%40kyle.netcamp.se
but am not sure how to directly apply the discussion.

d. You wrote:
> >*** and an interesting last piece: using (ppcre:split "~" line
> >:sharedp t), adding a (declare (type simple-base-string
> >target-string)) to split and avoiding apply #'subseq, it's hardly
> >slower:

To clarify, did you put
(declare (type simple-base-string target-string))
in api.lisp?

Could you explain how you: "avoiding apply #'subseq"?

Again, many thanks for your work on this, and any answers you can
provide!

Cheers,
David

> >COPY 1.42


> >2 (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24) 24.64
> >24.64 63.98
> >3 SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII
> >7.58 9.00 71.56
> > Callers
> > SB-INT:FAST-READ-CHAR-
> >REFILL 9.00
> > Calls
> > SB-IMPL::REFILL-INPUT-

> >BUFFER 1.42


> >4 PARSE-TEXT 6.16
> >93.84 77.73
> > Callers
> > NIL
> >93.84
> > Calls
> > READ-LINE
> >53.08
> > SB-
> >IMPL::GETHASH3 5.21
> > SB-VM::GENERIC-
> >+ 4.74
> > MAKE-HASH-
> >TABLE 3.32
> > SB-KERNEL:UB8-BASH-
> >COPY 2.84
> > SB-KERNEL:%

> >PUTHASH 0.95

> >REF 2.24


> >
> >LENGTH 0.37
> >
> >REPLACE 0.37
> >2 SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII 11.57
> >16.42 27.99
> > Callers
> > SB-INT:FAST-READ-CHAR-REFILL
> >16.42
> > Calls
> > SB-IMPL::REFILL-INPUT-

> >BUFFER 4.85


> >3 (FLET #:CLEANUP-FUN-[CALL-WITHOUT-INTERRUPTS]24) 10.82
> >10.82 38.81
> >4 READ-LINE 10.07
> >76.12 48.88
> > Callers
> > PARSE-TEXT
> >76.12
> > Calls
> > REPLACE
> >49.25
> > SB-INT:FAST-READ-CHAR-REFILL
> >16.42
> > SB-KERNEL:%SHRINK-

> >VECTOR 0.37

> >BUFFER 3.70

> >REF 3.70


-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell. From the desktop to the data center, Linux is going
mainstream. Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4

Attila Lendvai

unread,
Dec 1, 2007, 10:09:36 AM12/1/07
to dj...@acm.org, sbcl-...@lists.sourceforge.net
> Hi Attila,
>
> Thanks for all of your work!


i'm glad it helps. i was just pasting the results into a file and at
the end i've thought it may be interesting to others.


> I'm wondering if you or anyone else had any thoughts on the following
> questions:
>
> a. While this is a /dramatic/ improvement:
> - It's still much slower than Python, and you mentioned that your
> results were comparable to Python. Any thoughts why? Did you get
> the same results on Gary's fake-data.tgz?


i was using the fake-data file, but chopped off the end at around 40
MB, iirc. i didn't want swapping and disk reading to influence the
measurements.

at the end it was about 1/4th of the initial time, so that 23 secs for
you should have gone down to 5.75. it was on linux x86_64.

but please note that as i've mentioned, applying that
simple-base-string patch to sbcl disables optimizations for a REPLACE
call in ansi-stream-read-line. you can read that in the compiler notes
when C-cC-c'ing ansi-stream-read-line. it's relatively easy to rewrite
the code to reenable it by adding an ecase for the stream element type
and adding two distinct calls to REPLACE. that turned the 3x speedup
for me into a 4x speedup. but that's a kludge, so it's not included in
the patch i've sent. (see Christophe's comment on a possible proper
solution)


> - The reduction in consing in the example, was dramatic, but didn't
> compare to your results. Again, any thoughts why? Did you get
> the same results on Gary's fake-data.tgz?
>
> b. If I run (sb-ext:gc :full t) after each run, I can repeatedly call
> PARSE-TEXT with no issues, if I don't make the gc call, I get dropped
> into the debugger, is this to be expected?


that's a limitation of the current copying gc in sbcl, but i don't
know much more about the details.


> d. You wrote:
> > >*** and an interesting last piece: using (ppcre:split "~" line
> > >:sharedp t), adding a (declare (type simple-base-string
> > >target-string)) to split and avoiding apply #'subseq, it's hardly
> > >slower:
> To clarify, did you put
> (declare (type simple-base-string target-string))
> in api.lisp?


yes


> Could you explain how you: "avoiding apply #'subseq"?


i've replaced the (funcall substr-fn ...) to a simple (subseq ...) and
C-cC-c'd the function (as opposed to my bogus comment about #'apply i
wrote from memory). using funcall/apply hinders the application of
numerous compiler optimizations.

the key thing is using the profiler and checking the compiler notes
about failed optimizations when the (optimize (speed 3)) declaration
is in effect. i was using sb-sprof with Juho's brilliant
slime-profile-browser:
http://jsnell.iki.fi/blog/archive/2006-11-19-sb-sprof.html

hth,

--
attila

Juho Snellman

unread,
Dec 1, 2007, 5:58:52 PM12/1/07
to dj...@acm.org, sbcl-...@lists.sourceforge.net, Attila Lendvai
"David J. Neu" <davi...@gmail.com> writes:
> Hi Attila,
>
> Thanks for all of your work!
>
> We're very interested in improving the performance of the program that
> Gary posted, and have tried some comparisions with and without your
> changes, and using Python.

Ok. May I ask why you're interested in this? Reading text files and
manipulating strings are really the bread and butter features of
scripting languages, and will have been optimized accordingly. So it's
somewhat unreasonable to expect SBCL to be exactly as fast in the
generic case.

That said, I've attached a version of the benchmark program that runs
in 2.5s versus 4.3s for the Python version.

parse-text-mmap.lisp

Marco Antoniotti

unread,
Dec 3, 2007, 2:28:39 PM12/3/07
to sbcl-devel

> <parse-text-mmap.lisp>
> (You might object that the program is more complex than the original
> code. But if the factor of 2 difference in runtimes between the
> earlier Lisp solution and Python is an issue for you, surely a little
> extra complexity is a price worth paying for something that's a factor
> of 2 faster than the Python program.)

This is a very interesting result, but it begs the question: does
Python the language mmaps the file internally when getting its lines?
if yes (I don't know), then your code is a perfect replacement.

Cheers

--
Marco Antoniotti

David J. Neu

unread,
Dec 3, 2007, 5:11:56 PM12/3/07
to Juho Snellman, sbcl-...@lists.sourceforge.net
Hi Juho,

Many thanks for taking the time to craft a solution - it runs
extremely fast - much faster than the Python script and dramatically
reduced the number of bytes consed:

Evaluation took:
1.937 seconds of real time
1.791011 seconds of user run time
0.119532 seconds of system run time
[Run times include 0.432 seconds GC run time.]


0 calls to %EVAL
0 page faults and

103,786,536 bytes consed.

However, on the third attempt to run it I get

System call error 12 (Cannot allocate memory)
[Condition of type SB-POSIX:SYSCALL-ERROR]

is there a way around this?

> Ok. May I ask why you're interested in this?

Sure, for two reasons:

1. We have a production app that processes large quantities of texual
and binary data. As part of textual side, we're trying to tune some
learning algorithms, and want to work in CL.

2. Reading text files is a common task, and we're assuming that if
we're running into problems, it's likely someone is also, and this
could be a barrier to them adopting CL.

> Reading text files and manipulating strings are really the bread and
> butter features of scripting languages, and will have been optimized
> accordingly. So it's somewhat unreasonable to expect SBCL to be
> exactly as fast in the generic case.

OK, but:

LispWorks performs quite nicely compare to the Python program's
4.246053 seconds:

LispWorks against a text file w/out unicode using original code
as posted by Gary:
User time = 6.106
System time = 0.202
Elapsed time = 6.349
Allocation = 756144540 bytes
0 Page faults
Timing the evaluation of (PARSE-TEXT "fake-data.txt")

LispWorks against a text file w/out unicode using original code
as posted by Gary using cl-ppcre:split rather than split-sequence:
split-sequence:
User time = 6.426
System time = 0.240
Elapsed time = 6.716
Allocation = 602020240 bytes
0 Page faults
Timing the evaluation of (PARSE-TEXT-USING-SPLIT "fake-data.txt")

SBCL 1.0.12.14, with --dynamic-space-size 1500, against a text file w/out unicode using original code
as posted by Gary:
TEXT-PARSER> (time (parse-text "fake-data.txt"))
Evaluation took:
19.346 seconds of real time
16.760124 seconds of user run time
2.497303 seconds of system run time
[Run times include 8.315 seconds GC run time.]


0 calls to %EVAL
0 page faults and

4,183,290,560 bytes consed.

SBCL 1.0.12.14 with --dynamic-space-size 1500, against a text file
w/out unicode using original code as posted by Gary using
cl-ppcre:split rather than split-sequence:
Evaluation took:
16.284 seconds of real time
13.648916 seconds of user run time
2.546111 seconds of system run time
[Run times include 8.323 seconds GC run time.]


0 calls to %EVAL
0 page faults and

3,356,522,032 bytes consed.

And, then there's the gc issue - the orginal sbcl program can't be run
twice without dropping into the gc ... so thanks for the link about gc
characteristics and tuning:

> http://thread.gmane.org/gmane.lisp.steel-bank.devel/9879/focus=9883

unfortunately, after doing
(define-alien-variable gencgc-oldest-gen-to-gc char)
(setf gencgc-oldest-gen-to-gc 0)
at the top level, the program drops into the ldb before
completing even one run.

BTW, Attila's solution can be run repeatedly without memory issues, I
assume due to the "non-unicode" savings.

My question to the sbcl developers is: Is there a solution to reading
a plain (non-unicode) text file into a data structure, that can be
incorporated into sbcl, maybe something along the lines of what Attila
provided, that will result in speed and memory usage on par with
Python, or LispWorks, or in the best case Juho's program?

Again, many thanks for the help - we hope the solutions being offered
here are useful to others!

Cheers,
David

> ;;;; Utilities to mmap a file directly into an SBCL base string
>
> (defmacro with-mmaped-base-string ((string file) &body body)
> (let ((handle-var (gensym))
> (stream-var (gensym)))
> `(with-open-file (,stream-var ,file)
> (let ((,handle-var (mmap-as-base-string ,stream-var)))
> (unwind-protect
> (let ((,string (mmap-handle-string ,handle-var)))
> ,@body)
> (mmap-close ,handle-var))))))
>
> (defstruct mmap-handle
> (string (coerce "" 'base-string) :type simple-base-string)
> fd
> address
> length)
>
> (defun mmap-close (handle)
> (sb-posix:munmap (mmap-handle-address handle)
> (mmap-handle-length handle)))
>
> (defun mmap-as-base-string (stream)
> (declare (optimize debug)
> (notinline sb-posix::mmap))
> (with-open-file (devnull "/dev/null")
> (let* ((length (file-length stream))
> (sap1 (sb-posix:mmap nil
> (+ length 4096)
> (logior sb-posix:prot-read sb-posix:prot-write)
> sb-posix:map-private
> (sb-impl::fd-stream-fd stream)
> 0))
> (sap2 (sb-posix:mmap (sb-sys:sap+ sap1 4096)
> length
> sb-posix:prot-read
> (logior sb-posix:map-private sb-posix:map-fixed)
> (sb-impl::fd-stream-fd stream)
> 0))
> (handle (make-mmap-handle :address sap1
> :length length)))
> ;; simple-base-string header word
> (setf (sb-sys:sap-ref-word sap2 (- (* 2 sb-vm:n-word-bytes)))
> sb-vm:simple-base-string-widetag)
> ;; simple-base-string length word (as fixnum)
> (setf (sb-sys:sap-ref-word sap2 (- (* 1 sb-vm:n-word-bytes)))
> (ash length sb-vm:n-fixnum-tag-bits))
> (setf (mmap-handle-string handle)
> (sb-kernel:%make-lisp-obj
> (logior sb-vm:other-pointer-lowtag
> (- (sb-sys:sap-int sap2) (* 2 sb-vm:n-word-bytes)))))
> handle)))
>
> ;;;; Implement the benchmark
>
> (defun split-and-trim-sequence (delimiter string start-of-line end-of-line trim)
> (declare (type simple-base-string string)
> (type fixnum start-of-line end-of-line)
> (optimize speed))
> (loop for start = start-of-line then (1+ end)
> for end = (position delimiter string :start start :end end-of-line)
> for end-pos = (or end end-of-line)
> for length = (- end-pos start)
> do (loop while (< start end-pos)
> while (eql (aref string start) trim)
> do (incf start))
> do (loop until (eql end-pos start)
> while (eql (aref string (1- end-pos)) trim)
> do (decf end-pos))
> collect (if (< length 64)
> ;; A displaced array takes around 64 bytes. For strings
> ;; shorter than 64 base-chars we might as well just
> ;; make a simple string.
> (subseq string start end-pos)
> (make-array length
> :element-type 'base-char
> :displaced-to string
> :displaced-index-offset start))
> while (and end (< end end-of-line))))
>
> (defun parse-text (filename)
> (declare (optimize speed))
> (with-mmaped-base-string (string filename)
> ;; Note that the contents of the hash-table won't be valid outside
> ;; the dynamic scope of WITH-MMAPED-BASE-STRING, since we're
> ;; displacing arrays to STRING.


> (let ((ht (make-hash-table :test 'equal)))

> (loop for start = 0 then (1+ end)
> for end = (position #\Newline string :start start)
> for end-pos = (or end (length string))
> do (let ((fields (split-and-trim-sequence #\~ string start end-pos
> #\space)))
> (when (= (length (the list fields)) 3)
> (destructuring-bind (id attribute value) fields


> (when (not (gethash id ht))
> (setf (gethash id ht) (make-hash-table :test 'equal)))
> (let ((fields-ht (gethash id ht)))
> (setf (gethash attribute fields-ht) value)))))

> while end)
> (print (hash-table-count ht))
> nil)))

>
> (You might object that the program is more complex than the original
> code. But if the factor of 2 difference in runtimes between the
> earlier Lisp solution and Python is an issue for you, surely a little
> extra complexity is a price worth paying for something that's a factor
> of 2 faster than the Python program.)
>

> > b. If I run (sb-ext:gc :full t) after each run, I can repeatedly call
> > PARSE-TEXT with no issues, if I don't make the gc call, I get dropped
> > into the debugger, is this to be expected?
>

> Yes.


>
> > c. Juho had mentioned that tuning the garbage collector was discussed
> > on this list, would this address point b., and could someone point me
> > to the appropriate post?
> >
> > I found this post
> > http://sourceforge.net/mailarchive/message.php?msg_id=l37iskysld.fsf%40kyle.netcamp.se
> > but am not sure how to directly apply the discussion.
>

> That is a completely unrelated discussion. See for example:
>
> http://thread.gmane.org/gmane.lisp.steel-bank.devel/9879/focus=9883
>
> --
> Juho Snellman

John Wiegley

unread,
Dec 3, 2007, 10:17:58 PM12/3/07
to SBCL Devel-list
This might not be apropos, but I just came across this tech note from
Allegro
where they are doing a very similar thing with CL: using it to parse
strings
from gigantic text files. The way they combatted the performance
issue was
to custom-build a non-consing version of read-line. I wonder if such
a thing
would improve SBCL's performance in this example?

http://www.franz.com/support/tech_corner/cons-tricks-121306.lhtml

John

On Dec 3, 2007, at 6:11 PM, David J. Neu wrote:

> Many thanks for taking the time to craft a solution - it runs
> extremely fast - much faster than the Python script and dramatically
> reduced the number of bytes consed:
>
> Evaluation took:
> 1.937 seconds of real time
> 1.791011 seconds of user run time
> 0.119532 seconds of system run time
> [Run times include 0.432 seconds GC run time.]
> 0 calls to %EVAL
> 0 page faults and
> 103,786,536 bytes consed.

-------------------------------------------------------------------------

David J. Neu

unread,
Dec 4, 2007, 7:15:26 AM12/4/07
to John Wiegley, SBCL Devel-list
Yes, this would be very nice, and there have also be many suggestions
in this thread about "improving" read-line itself such as:

- having its return type depend on its stream's element-type, so e.g.
that it would in this example return a base-string

- increasing +ANSI-STREAM-IN-BUFFER-LENGTH+

Cheers,
David

Juho Snellman

unread,
Dec 4, 2007, 11:15:40 AM12/4/07
to dj...@acm.org, sbcl-...@lists.sourceforge.net
"David J. Neu" <davi...@gmail.com> writes:
> However, on the third attempt to run it I get
>
> System call error 12 (Cannot allocate memory)
> [Condition of type SB-POSIX:SYSCALL-ERROR]
>
> is there a way around this?

Hmm... looks like an off-by-4096, the last mmaped page was never freed
-> address space becomes fragmented. I wonder why it didn't fail on
Linux. Maybe the following would help?

(defvar +page-size+ (sb-posix:getpagesize))

(defun mmap-as-base-string (stream)
(declare (optimize debug)
(notinline sb-posix::mmap))
(with-open-file (devnull "/dev/null")

(let* ((file-length (file-length stream))
(map-length (+ file-length +page-size+))
(sap1 (sb-posix:mmap nil
map-length


(logior sb-posix:prot-read sb-posix:prot-write)
sb-posix:map-private
(sb-impl::fd-stream-fd stream)
0))

(sap2 (sb-posix:mmap (sb-sys:sap+ sap1 +page-size+)
file-length


sb-posix:prot-read
(logior sb-posix:map-private sb-posix:map-fixed)
(sb-impl::fd-stream-fd stream)
0))
(handle (make-mmap-handle :address sap1

:length map-length)))


;; simple-base-string header word
(setf (sb-sys:sap-ref-word sap2 (- (* 2 sb-vm:n-word-bytes)))
sb-vm:simple-base-string-widetag)
;; simple-base-string length word (as fixnum)
(setf (sb-sys:sap-ref-word sap2 (- (* 1 sb-vm:n-word-bytes)))

(ash file-length sb-vm:n-fixnum-tag-bits))


(setf (mmap-handle-string handle)
(sb-kernel:%make-lisp-obj
(logior sb-vm:other-pointer-lowtag
(- (sb-sys:sap-int sap2) (* 2 sb-vm:n-word-bytes)))))
handle)))

> > Ok. May I ask why you're interested in this?
> Sure, for two reasons:
>
> 1. We have a production app that processes large quantities of texual
> and binary data. As part of textual side, we're trying to tune some
> learning algorithms, and want to work in CL.

It's not obvious to me that Gary's benchmark accurately models most
processes I've seen that process large quantities of text
data. Usually they for example don't hang on to all of the data for
the duration of the whole process.

> > Reading text files and manipulating strings are really the bread and
> > butter features of scripting languages, and will have been optimized
> > accordingly. So it's somewhat unreasonable to expect SBCL to be
> > exactly as fast in the generic case.
> OK, but:
>
> LispWorks performs quite nicely compare to the Python program's
> 4.246053 seconds:
>
> LispWorks against a text file w/out unicode using original code
> as posted by Gary:
> User time = 6.106

That's still basically crappy performance. Obviously not nearly as
crappy as SBCL, but it's still 2x slower than an idiomatically written
Python program would be.

> > http://thread.gmane.org/gmane.lisp.steel-bank.devel/9879/focus=9883
>
> unfortunately, after doing
> (define-alien-variable gencgc-oldest-gen-to-gc char)
> (setf gencgc-oldest-gen-to-gc 0)
> at the top level, the program drops into the ldb before
> completing even one run.

Sure. You're trying to stuff 230*4 (unicode)*2 (copying gc)+a bit
(non-strings) of data into 1500MB, which can reasonably be expected to
fail.



> My question to the sbcl developers is: Is there a solution to reading
> a plain (non-unicode) text file into a data structure, that can be
> incorporated into sbcl, maybe something along the lines of what Attila
> provided

I'm going to be committing a nicer implementation of READ-LINE once I
get some machine with my ssl keys back on the net. This is basically
just fixing the issues that the current implementation has with long
lines (this reduces the amount of time for reading in the input file
from 5.5s to 3.5s).

I don't think doing any base-string specific handling in READ-LINE
would make much sense as long as the fd-stream character input buffer
is implemented as a (SIMPLE-ARRAY CHARACTER).

The simple solution would be to just wrap the mmap hack into a couple
of gray streams for base-char or (unsigned-byte 8) input. Should be
trivial, it's only one more data copy than that direct solution, but
it would hide all the ugliness. I would rather see this in a user
library though, since once fu-streams have been implemented (ha, ha),
those gray streams would basically be obsolete.

> that will result in speed and memory usage on par with Python

Unlikely to ever happen.

> or LispWorks,

Somewhat improbable in the near future on that exact program out of
the box. With slight rewriting and using a gray stream like above,
sure, right now.

> or in the best case Juho's program?

That program itself is obviously not going to go into SBCL :-) I'm not
wild about adding support for mmaping stuff as lisp arrays in SBCL
either, though it has been proposed before.

David J. Neu

unread,
Dec 4, 2007, 1:09:08 PM12/4/07
to Juho Snellman, sbcl-...@lists.sourceforge.net
Hi Juho,

On Tue, Dec 04, 2007 at 06:15:40PM +0200, Juho Snellman wrote:
> "David J. Neu" <davi...@gmail.com> writes:
> > However, on the third attempt to run it I get
> >
> > System call error 12 (Cannot allocate memory)
> > [Condition of type SB-POSIX:SYSCALL-ERROR]
> >
> > is there a way around this?
>
> Hmm... looks like an off-by-4096, the last mmaped page was never freed
> -> address space becomes fragmented. I wonder why it didn't fail on
> Linux. Maybe the following would help?
>

Yep, that fixed it - BTW, I'm running FreeBSD 6.2.

> It's not obvious to me that Gary's benchmark accurately models most
> processes I've seen that process large quantities of text
> data. Usually they for example don't hang on to all of the data for
> the duration of the whole process.

I can't say for sure, but it's sure fast when running experiments, or
for keeping a big lookup table in a production app.

> I'm going to be committing a nicer implementation of READ-LINE once I
> get some machine with my ssl keys back on the net. This is basically
> just fixing the issues that the current implementation has with long
> lines (this reduces the amount of time for reading in the input file
> from 5.5s to 3.5s).

Great - thanks!

> I don't think doing any base-string specific handling in READ-LINE
> would make much sense as long as the fd-stream character input buffer
> is implemented as a (SIMPLE-ARRAY CHARACTER).

Hmmm, does that imply that there's no way (using classic functions
like read-line) to pull in a plain old ASCII text file w/out ending up
with the unicode bloat issue?



> The simple solution would be to just wrap the mmap hack into a couple
> of gray streams for base-char or (unsigned-byte 8) input. Should be
> trivial, it's only one more data copy than that direct solution, but
> it would hide all the ugliness. I would rather see this in a user
> library though, since once fu-streams have been implemented (ha, ha),
> those gray streams would basically be obsolete.

Seems like a wonderfully powerful tool to have available.

> > or LispWorks,
>
> Somewhat improbable in the near future on that exact program out of
> the box. With slight rewriting and using a gray stream like above,
> sure, right now.

Can I ask why?



> > or in the best case Juho's program?
>
> That program itself is obviously not going to go into SBCL :-) I'm not
> wild about adding support for mmaping stuff as lisp arrays in SBCL
> either, though it has been proposed before.
>

Again, seems like a wonderfully powerful tool.

Bottomline, I understand that SBCL can't beat every language on every
benchmark, but on a task as common as reading a text file, it seems
that there'd be great benefit in:

- the mods to read-line that you suggested, as well as a way to get
plain vanilla ascii strings

- the mmaping stuff

- something like Franz has implemented
http://franz.com/support/tech_corner/cons-tricks-121306.lhtml

Many thanks for your help, we use SBCL here everyday in prodution, and
truly value the improvements that are being made.

Cheers,
David

Juho Snellman

unread,
Dec 4, 2007, 8:25:07 PM12/4/07
to dj...@acm.org, sbcl-...@lists.sourceforge.net
"David J. Neu" <davi...@gmail.com> writes:
> > I don't think doing any base-string specific handling in READ-LINE
> > would make much sense as long as the fd-stream character input buffer
> > is implemented as a (SIMPLE-ARRAY CHARACTER).
>
> Hmmm, does that imply that there's no way (using classic functions
> like read-line) to pull in a plain old ASCII text file w/out ending up
> with the unicode bloat issue?

That would be the implication. The current stream implementation is
basically ugly as hell. While I'm all for improving it, I'd prefer
those improvements to reduce the number of kludges, not increase it.

I disagree with Christophe in that I don't think there's any
fundamental reason to not allow READ-LINE to return base-strings. It
seems quite reasonable to me for this to happen when a BASE-CHAR
element-type has been specified.



> > The simple solution would be to just wrap the mmap hack into a couple
> > of gray streams for base-char or (unsigned-byte 8) input. Should be
> > trivial, it's only one more data copy than that direct solution, but
> > it would hide all the ugliness. I would rather see this in a user
> > library though, since once fu-streams have been implemented (ha, ha),
> > those gray streams would basically be obsolete.
>
> Seems like a wonderfully powerful tool to have available.

To have available sure, but we already have two unsatisfying stream
implementations (fd-streams and sb-simple-streams) distributed with
SBCL. These new streams would essentially be no more satisfying than
the others (just a lot easier to implement, due to being more
specialized).

The one thing we really don't want to have in core SBCL is more
half-assed stream implementations, and which we then have to support
as an exported interface for all eternity. What we want is the
well-designed and well-implemented replacement for the current ones,
which would be maintainable, perform well out of the box, and make be
easy to extend when people have special needs.

(There's also the slight problem that some of the speed increase of
this hack comes from completely ignoring external formats. This would
be ok if base-strings were defined to contain iso-8859-1, but they're
actually defined to contain just ascii. Ok for a quick hack, less good
for something included with SBCL. There's always the option of
defining base-char to map to 8859-1 instead, but that's the discussion
we've had before. I haven't measured whether this is an effect that
actually matters.)

So like I said, it's not a bad idea as such. I'll happily write that
if somebody really wants it. I just don't want to maintain it.

> > > or LispWorks,
> >
> > Somewhat improbable in the near future on that exact program out of
> > the box. With slight rewriting and using a gray stream like above,
> > sure, right now.
>
> Can I ask why?

Because the program as written tickles too many weak spots in SBCL
(excessive copying of data in streams, a copying gc for which a
program that retains most of what it allocates is poison, the complete
lack of type declarations in the code). But ok, actually just slight
rewriting of the program would be sufficient to get that speed.

A slight rewriting of the benchmark program would fix some of those
issues (ok, looks like just that slight rewrite + the READ-LINE change
are enough to make it about as fast as Lispworks, the same slight
rewrite + a bare bones mmaped gray stream would be the same speed as
the suboptimally written python program).

But it is for example somewhat unlikely that the generic array
operations on arrays of undeclared types will suddenly become an order
of magnitude faster in the next few releases, and that would be
required for the program written exactly the way it is now written to
become that fast.

> Bottomline, I understand that SBCL can't beat every language on
> every benchmark, but on a task as common as reading a text file, it
> seems that there'd be great benefit in:
>
> - the mods to read-line that you suggested, as well as a way to get
> plain vanilla ascii strings

Fair enough.



> - something like Franz has implemented
> http://franz.com/support/tech_corner/cons-tricks-121306.lhtml

The general view is that the problems that simple-stream-read-line
solves are not interesting in the context of SBCL. As a first
approximation allocating something that immediately becomes garbage is
free with the SBCL generational collector, only retaining something
has a cost. ACL uses a different GC, which must have different
performance properties.

As a rough benchmark, the aforementioned slight rewrite + gray streams
spends 2 seconds out of 4 doing GC. If the strings (after splitting)
are discarded instead of storing them into hash-tables, the time spent
in GC is 0.1s while the amount of consing is basically the same. Using
the Franz interface to avoid consing temporaries would save
approximately that 0.1s.

--
Juho Snellman

Brian Downing

unread,
Dec 4, 2007, 8:35:51 PM12/4/07
to Juho Snellman, sbcl-...@lists.sourceforge.net
On Wed, Dec 05, 2007 at 03:25:07AM +0200, Juho Snellman wrote:
> (There's also the slight problem that some of the speed increase of
> this hack comes from completely ignoring external formats. This would
> be ok if base-strings were defined to contain iso-8859-1, but they're
> actually defined to contain just ascii. Ok for a quick hack, less good
> for something included with SBCL. There's always the option of
> defining base-char to map to 8859-1 instead, but that's the discussion
> we've had before. I haven't measured whether this is an effect that
> actually matters.)

I would personally really like to see BASE-CHAR/STRING be able to hold
iso-8859-1. I've had several cases where I wanted to parse "plain ascii
with other random binary garbage I didn't care about", and doing that
in SBCL as it stands without taking the unicode hit is pretty painful.

If I could (read-sequence base-string stream :encoding 'iso-8859-1) and
have it just do a fast read/blit with no translation behind the scenes
that would be great. As it is I had to rewrite my parser to deal with
byte arrays instead to get the speed I required.

For my application, this was about a 10x speed increase.

I managed to hack BASE-CHAR to be able to hold 256 characters in a
Unicode-enabled SBCL, but the above read-sequence still took the slow
path.

-bcd

Faré

unread,
Dec 5, 2007, 10:50:56 AM12/5/07
to sbcl-...@lists.sourceforge.net
On 04/12/2007, Brian Downing <bd-sour...@lavos.net> wrote:
> On Wed, Dec 05, 2007 at 03:25:07AM +0200, Juho Snellman wrote:
> > (There's also the slight problem that some of the speed increase of
> > this hack comes from completely ignoring external formats. This would
> > be ok if base-strings were defined to contain iso-8859-1, but they're
> > actually defined to contain just ascii. Ok for a quick hack, less good
> > for something included with SBCL. There's always the option of
> > defining base-char to map to 8859-1 instead, but that's the discussion
> > we've had before. I haven't measured whether this is an effect that
> > actually matters.)
>
> I would personally really like to see BASE-CHAR/STRING be able to hold
> iso-8859-1. I've had several cases where I wanted to parse "plain ascii
> with other random binary garbage I didn't care about", and doing that
> in SBCL as it stands without taking the unicode hit is pretty painful.

Would a "SB-EXT:ASCII-CHAR == BASE-CHAR < SB-EXT:LATIN1-CHAR <
CHARACTER" hierarchy satisfy everyone? Would there be a reason to
reject a patch that did just that? (Now, to find someone to write it,
is different).

[ François-René ÐVB Rideau | Reflection&Cybernethics | http://fare.tunes.org ]
You're currently going through a difficult transition period called "Life."

Juho Snellman

unread,
Dec 5, 2007, 11:19:06 AM12/5/07
to Faré, sbcl-...@lists.sourceforge.net
"Faré" <fah...@gmail.com> writes:
> Would a "SB-EXT:ASCII-CHAR == BASE-CHAR < SB-EXT:LATIN1-CHAR <
> CHARACTER" hierarchy satisfy everyone? Would there be a reason to
> reject a patch that did just that? (Now, to find someone to write it,
> is different).

I'd rather not add more simple-string types:

* Need to change a large amount of places that makes assumptions about
the number of string types.
* Waste two more widetags.
* Make the type dispatch on (aref (the simple-string foo) bar) more
expensive.

Just having base-char-code-limit be 256 would be preferrable to me.

--
Juho Snellman

Richard M Kreuter

unread,
Dec 5, 2007, 11:59:47 AM12/5/07
to Juho Snellman, Faré, sbcl-...@lists.sourceforge.net
Juho Snellman writes:

> I'd rather not add more simple-string types:
>
> * Need to change a large amount of places that makes assumptions about
> the number of string types.
> * Waste two more widetags.
> * Make the type dispatch on (aref (the simple-string foo) bar) more
> expensive.
>
> Just having base-char-code-limit be 256 would be preferrable to me.

Somebody correct me if I'm mistaken, but doesn't sb-unicode define
base-char-code-limit to be ASCII so that simple-base-strings don't need
encoding on the way out on systems where the common external formats are
upwardly compatible with ASCII's encoding?

--
RmK

James Y Knight

unread,
Dec 5, 2007, 1:03:22 PM12/5/07
to Brian Downing, sbcl-...@lists.sourceforge.net, Juho Snellman

On Dec 4, 2007, at 8:35 PM, Brian Downing wrote:

> On Wed, Dec 05, 2007 at 03:25:07AM +0200, Juho Snellman wrote:
>> (There's also the slight problem that some of the speed increase of
>> this hack comes from completely ignoring external formats. This would
>> be ok if base-strings were defined to contain iso-8859-1, but they're
>> actually defined to contain just ascii. Ok for a quick hack, less
>> good
>> for something included with SBCL. There's always the option of
>> defining base-char to map to 8859-1 instead, but that's the
>> discussion
>> we've had before. I haven't measured whether this is an effect that
>> actually matters.)
>
> I would personally really like to see BASE-CHAR/STRING be able to hold
> iso-8859-1. I've had several cases where I wanted to parse "plain
> ascii
> with other random binary garbage I didn't care about", and doing that
> in SBCL as it stands without taking the unicode hit is pretty painful.

I like that it only holds ASCII, in that it forces the programmer to
think, and use strings for what they're meant for: text. If you want
to store random bytes, you ought to be using a byte array. And if you
want to store text, it's pretty rare that you actually really only
want LATIN-1 (or if you do want that, it's pretty rare that you
*should* want that)

James

Brian Downing

unread,
Dec 5, 2007, 1:50:12 PM12/5/07
to James Y Knight, Brian Downing, Juho Snellman, sbcl-...@lists.sourceforge.net
On Wed, Dec 05, 2007 at 01:03:22PM -0500, James Y Knight wrote:
> I like that it only holds ASCII, in that it forces the programmer to
> think, and use strings for what they're meant for: text. If you want
> to store random bytes, you ought to be using a byte array. And if you
> want to store text, it's pretty rare that you actually really only
> want LATIN-1 (or if you do want that, it's pretty rare that you
> *should* want that)

I think this is a decent theoretical design. Unfortauntely, I think it
pretty much sucks from a practical standpoint.

Let me use my earlier example. I was parsing RCS files. RCS is a
mostly-text-based grammar with some binary sections (i.e. the actual
file data). There is no way to know where the binary parts are without
parsing the text.

For performance and memory-consumption reasons, I could not deal with
using 4x the memory to load an RCS file into an SBCL unicode string.
And since the file can contain high-byte characters, I couldn't load it
into a base-string as they exist today. Further, I couldn't load the
file a "byte at a time", reading the right kind of element (character
or byte), as that was very slow.

So, I wound up slurping the whole thing into a byte array, as you
recommend above. I think this sucks. Here's why:

I had to rewrite my lexer to work on a byte array, not a string. This
means:

* The code is very clumsy, and not at all prototypical with what you'd
expect for string operations.

* I couldn't use any of the nice built-in CL string functions.

* I couldn't use any of the nice external CL libraries that deal in
strings. (*cough* CL-PPCRE *cough*)

The fact is the whole project would have been a lot easier had I been
able to reason about the file as a string, efficiently.

Frankly, it's very common to have arbitrary binary data with large amounts
of ASCII text. Making this a pain in the ass to deal with efficiently
in SBCL doesn't seem like a good practical decision.

Add more interesting games, like mmapping a file into memory and sticking
an SBCL base-string header in front of it, and I think you really want
to be able to handle 8-bit strings.

-bcd

Ingvar

unread,
Dec 5, 2007, 2:02:14 PM12/5/07
to sbcl-...@lists.sourceforge.net
James Y Knight writes:
> On Dec 4, 2007, at 8:35 PM, Brian Downing wrote:
[ SNIP ]

> > I would personally really like to see BASE-CHAR/STRING be able to hold
> > iso-8859-1. I've had several cases where I wanted to parse "plain
> > ascii
> > with other random binary garbage I didn't care about", and doing that
> > in SBCL as it stands without taking the unicode hit is pretty painful.
>
> I like that it only holds ASCII, in that it forces the programmer to
> think, and use strings for what they're meant for: text. If you want
> to store random bytes, you ought to be using a byte array. And if you
> want to store text, it's pretty rare that you actually really only
> want LATIN-1 (or if you do want that, it's pretty rare that you
> *should* want that)

For me, essentially every single text-processing I've done in the last 20
years fits quite well inside "Just LATIN-1" and frequently, the source data is
encoded in Latin-1 anyway.

Saying that, I can certainly see the logic to "BASE-CHAR can only take ASCII"
and taht covers probably 90% of the times I use strings anyway (for "random
binary data" I tend to use arrays of (UNSIGNED-BYTE 8)).

//INgvar

Christophe Rhodes

unread,
Dec 5, 2007, 4:22:37 PM12/5/07
to Brian Downing, SBCL
Brian Downing <bd-sour...@lavos.net> writes:

> * I couldn't use any of the nice external CL libraries that deal in
> strings. (*cough* CL-PPCRE *cough*)

So, as the major culprit for making the theoretical decision to have
base-strings handle ASCII only, I'll ask: how hard would it be to
adapt CL-PPCRE or similar libraries to work on octet vectors, using
the same programmer-friendly stringy input format that currently
exists?

The other point, parsing file contents where the encoding is not a
priori known, sounds like an excellent argument for implementing (setf
stream-external-format) and bivalent streams, but not so much for
stuffing arbitrary binary data into a sequence of characters.

I'll admit, I have a strong bias against pandering to a particular
interpretation of what happens to be convenient today when that is the
only argument that is advanced for doing something; if people could
convince me that latin-1 is the Right Thing in all circumstances for
BASE-CHAR to represent, then I would not argue against it. (Duh. :-)

Cheers,

Christophe

Brian Downing

unread,
Dec 5, 2007, 5:24:18 PM12/5/07
to Christophe Rhodes, Brian Downing, SBCL
On Wed, Dec 05, 2007 at 09:22:37PM +0000, Christophe Rhodes wrote:
> Brian Downing <bd-sour...@lavos.net> writes:
> > * I couldn't use any of the nice external CL libraries that deal in
> > strings. (*cough* CL-PPCRE *cough*)
>
> So, as the major culprit for making the theoretical decision to have
> base-strings handle ASCII only, I'll ask: how hard would it be to
> adapt CL-PPCRE or similar libraries to work on octet vectors, using
> the same programmer-friendly stringy input format that currently
> exists?

I dunno, but I do know that I didn't want to. :)

> The other point, parsing file contents where the encoding is not a
> priori known, sounds like an excellent argument for implementing (setf
> stream-external-format) and bivalent streams, but not so much for
> stuffing arbitrary binary data into a sequence of characters.

The thing is what I really want is a "bivalent array." I wanted it
all, in memory, ready to be accessed in either form. Calling stream
accessors per character was way too slow.

A latin-1 base-string is as close as I could figure to getting that and
still being able to do string operations on it.

> I'll admit, I have a strong bias against pandering to a particular
> interpretation of what happens to be convenient today when that is the
> only argument that is advanced for doing something; if people could
> convince me that latin-1 is the Right Thing in all circumstances for
> BASE-CHAR to represent, then I would not argue against it. (Duh. :-)

Well, latin-1 is already a higher-class citizen than any other encoding,
as the first 256 Unicode code points are latin-1. Also, it's not as if
the first 128 are any more stable. A lot of Japanese code pages have
#\¥ at #x5C, for instance. And then there's EBCDIC...

So I'm not sure assuming latin-1 for BASE-CHAR is any less sane than
assuming ASCII for BASE-CHAR. And it's quite a bit more convenient for
things like I just described.

-bcd

Gary King

unread,
Dec 9, 2007, 2:21:11 PM12/9/07
to Nathan Froyd, sbcl-...@lists.sourceforge.net
Hi Nathan,

Can you say more or point me towards a reference for how Python does
handle strings? Why can't Python be using destructive operations?

thanks,

On Nov 21, 2007, at 4:22 PM, Nathan Froyd wrote:

>> 1. (Common) Lisp doesn't include any convenient destructive string
>> operations or even a buffered read-line approach (Franz has one now
>> but it's not standard). Having these around would, I think, be a good
>> thing.
>
> The Python code you posted earlier doesn't use destructive string
> operations--indeed, it can't, due to the nature of Python strings.

--
Gary Warren King, metabang.com
Cell: (413) 559 8738
Fax: (206) 338-4052
gwkkwg on Skype * garethsan on AIM

-------------------------------------------------------------------------


SF.Net email is sponsored by:

Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

Erik Huelsmann

unread,
Dec 9, 2007, 4:49:50 PM12/9/07
to Gary King, sbcl-...@lists.sourceforge.net
On Dec 9, 2007 8:21 PM, Gary King <gwk...@metabang.com> wrote:
> Hi Nathan,
>
> Can you say more or point me towards a reference for how Python does
> handle strings? Why can't Python be using destructive operations?

Because its strings are by definition immutable.


bye,


Erik.

Reply all
Reply to author
Forward
0 new messages