People have been writing scheme parsers for decades now, and I was
wondering if there is a consensus on what is the fastest and/or most
efficient method to parse scheme? What method, theoretical or
practical, leads to the fastest scheme parsers?
thx
- Adrian B
I can't think of anything that's likely to run faster than the
READ procedure.
Will
Cheers,
Adrian
The grammar of Scheme is quite straightforward. Therefore: Use a
parser generator. It is seldom worthwhile to write your own
parser.
--
Jens Axel Søgaard
The parsing problem for Scheme is Omega(n), where
n is the number of characters in the input. From
that it follows that any O(n) parsing algorithm
is within a constant factor of the best possible
performance. In most implementations, the read
procedure is implemented by a recursive descent
parser using one character of lookahead, although
a second character of lookahead may be used in a
few places.
> What
> implementation-tricks can be used to speed up parsing? etc. I am
> thinking of this more from an implementation perspective using a
> language like C or C++, but not specifically.
I think the most important trick is to write the
parser in Scheme instead of C or C++.
Consider the following benchmark results, obtained
on a SunBlade 1500 with no other users. All times
are in seconds, and are the average of three runs:
Chez Larceny MzScheme scheme48
******** ******** ******** ********
benchmark1 0.247 0.569 2.587 2.62
benchmark2 0.330 1.653 1.821 3.82
benchmark1 just uses read-char to read "nboyer.sch"
100 times, with no parsing at all. benchmark2 just
uses read to parse "nboyer.sch" 100 times.
In the two fastest systems, the read procedure is
written in Scheme. It therefore appears that the
most important trick, if you're after speed, is to
write the parser in Scheme instead of C or C++.
This trick doesn't always work, however, as shown
by the timing for scheme48. I suspect that scheme48's
read procedure is written in Scheme also, but it's
the slowest of the bunch.
MzScheme's read procedure is written in C or C++,
and runs faster than you can read the individual
characters from the file. That indicates a problem
with the performance of read-char in MzScheme; that
problem was one of the reasons they chose to write
the read procedure in C or C++ instead of Scheme.
Will
I haven't parsed scheme before, but I have parsed enough C, Java, Pascal, and
XML to know that for me it is always worthwhile to write your own parser. Not
from scratch, of course, one needs to have a suitable toolkit to build on,
essentially a decent tokenizer library.
The point though is that I have found the hard way that the effort of
specifying actions for grammar rules and providing the glue code to actually
use the parsing results is not much different from simply doing it all
yourself.
Properly designed parser code will have the appropriate methods that
correspond to the equivalent grammar elements, the result being that the
parsing code is just as readable as a grammar file itself.
There is the added benefit of flexiblity. Need n-token look-ahead for a
particular element? Something that is not quite context independent? Code
the custom work on the spot -- no tool limitations to deal with.
Versioning and licensing problems in using 3rd party parsing tools also just
go away.
The result is something that is as lightweight and flexible as I need it to
be, something that works perfectly with anything else I use since I have
complete control to change anything as needed.
--
Cheers, The Rhythm is around me,
The Rhythm has control.
Ray Blaak The Rhythm is inside me,
rAYb...@STRIPCAPStelus.net The Rhythm has my soul.
> In the two fastest systems, the read procedure is
> written in Scheme. It therefore appears that the
> most important trick, if you're after speed, is to
> write the parser in Scheme instead of C or C++.
And pick a decent Scheme compiler, yes.
> MzScheme's read procedure is written in C or C++,
> and runs faster than you can read the individual
> characters from the file. That indicates a problem
> with the performance of read-char in MzScheme; that
> problem was one of the reasons they chose to write
> the read procedure in C or C++ instead of Scheme.
But shouldn't we all code in C for efficiency? ;)
Aziz,,,
Of course. Even freshmen know that!
Will
Use Flex or similar to generate a lexer. Parsing is profoundly
easy in Lispy languages, and besides you want some ability to
generate, interpret, compile, or dynamic-load code at runtime
which complicates parser toolkits immensely, so it's not IMO
worthwhile to use a toolkit such as YACC to build a parser; it
was easier and simpler to make my own by hand.
Bear
In Chez 6.9c, on my iBook (800MHz G4), I get:
(time (many test-read ...))
2 collections
290 ms elapsed cpu time, including 10 ms collecting
405 ms elapsed real time, including 2 ms collecting
2665248 bytes allocated, including 2240992 bytes reclaimed
(time (many test-read-char ...))
no collections
80 ms elapsed cpu time
182 ms elapsed real time
108000 bytes allocated
I think Chez open-codes read-char at optimize-level 3. This
makes Chez, Larceny, and also my little compiler perform
read-char about only 3~4 times faster than read (for nboyer.sch,
100 times).
With the following numbers (if there is a way to scale them by
having Chez's times as the baseline), I would probably say the
3 fastest systems have their read implemented in Scheme, which
confirms your hypothesis.
(time (many test-read ...))
performed 6 collections
utime: 0.980s including gc: 0.010s
stime: 0.040s including gc: 0.000s
total: 1.020s including gc: 0.010s
rtime: 1.181s including gc: 0.018s
(time (many test-read-char ...))
performed 0 collections
utime: 0.210s including gc: 0.000s
stime: 0.030s including gc: 0.000s
total: 0.240s including gc: 0.000s
rtime: 0.315s including gc: 0.000s
Aziz,,,
Reading scheme involves two procedures: a lexer and a parser. The lexer
turns the stream of characters into tokens (such as left-paren, the
integer 42, the string "foobar", #f, etc.). The parser consumes these
tokens one at a time and assembles them into a (possibly nested) list.
When I tried implementing a reader for Scheme, I first wrote the lexer
in the most straight-forward fashion you can think of. Here are the
first few lines of the main lexer procedure. Essentially, it's a state
machine implemented using many mutually recursive procedures. For
example, when the lexer finds the double-quote character, it calls the
Str helper, which consumes all the characters up to the closing
double-quote (taking care of escape characters), and finally returns
(datum . "some string") for the parser to use. The entire lexer is
about 300 lines of Scheme.
(define lexer
(lambda (p)
(let ([c (read-char p)])
(cond
[(eof-object? c) 'eof]
[(char=? c #\() 'lp]
[(char=? c #\)) 'rp]
[(char=? c #\[) 'lb]
[(char=? c #\]) 'rb]
[(char-whitespace? c) (lexer p)]
[(char-numeric? c) (Num p (cons c '()))]
[(sign-char? c) (Sign p c)]
[(char=? c #\") (Str p '())]
[(char=? c #\;) (Semi p)]
[(char=? c #\.) (Dot p)]
[(char=? c #\') 'quote]
[(char=? c #\`) 'quasiquote]
[(char=? c #\,) (Unq p)]
[(char=? c #\#) (Hash p)]
[(char-symbolic? c) (Sym p (cons c '()))]
...
The parser is much simpler than the lexer because it does not have to
deal with all the special cases of the syntax. (ie. there are a zillion
things that start with # in my reader: #t, #f, #!eof, #(), #[], #23(23),
#\n, #\newline, #1=(#1#), #%prim, #2%prim, #3%prim, etc. I don't have
the full numeric tower so I don't deal with all the quirks of parsing
numbers yet). So, the parser is very small (around 80 lines of Scheme)
and very easy to write. Overall, the whole reader business is very
straightforward to do, and makes up very small percentage of any
implementation (less than 1%). I believe this is the fastest and
simplest way to parse scheme. Here are the first few lines of my
parser:
(define parse
(lambda (p)
(let ([t (get-token p)])
(case t
[(eof) '#!eof]
[(lp) (parse-list p)]
[(lb) (parse-list-b p)]
[(dot) (error 'read "unexpected dot")]
[(quote) (list 'quote (parse p))]
[(hash-quote) (list 'syntax (parse p))]
[(quasiquote) (list 'quasiquote (parse p))]
[(unquote) (list 'unquote (parse p))]
[(unquote-splicing) (list 'unquote-splicing (parse p))]
...
I hope this helps.
Aziz,,,
I suddenly remember the perfect place to start for Adrian.
The Summer '96 Scheme Workshop were on compilers. Eric Hilsdale and
Christopher Haynes were the instructors.
The Scanner/parser is described in
<http://www.cs.indiana.edu/eip/compile/scanparse.html>
The main page is here:
<http://www.cs.indiana.edu/eip/compile/>
The road map is here:
<http://www.cs.indiana.edu/eip/compile/roadmap.html>
And the coomplete code is here:
<http://www.cs.indiana.edu/eip/compile/code.html>
Note that the compiler produces sparc assembler - so they naturally
included a Sparc emulator written in Scheme...
Related papers:
"Destination-Driven Code Generation"
R. Kent Dybvig, Robert Hieb and Tom Butler.
<http://www.cs.indiana.edu/~dyb/papers/ddcg.ps>
"Compiler Construction Using Scheme".
Erik Hilsdale, J. Michael Ashley, R. Kent Dybvig and Daniel P. Friedman.
<http://repository.readscheme.org/ftp/papers/jmashley/fple95.pdf>
--
Jens Axel Søgaard
> I suddenly remember the perfect place to start for Adrian.
>
> The Summer '96 Scheme Workshop were on compilers. Eric Hilsdale and
> Christopher Haynes were the instructors.
>
> The Scanner/parser is described in
> <http://www.cs.indiana.edu/eip/compile/scanparse.html>
>
> The main page is here:
> <http://www.cs.indiana.edu/eip/compile/>
>
> The road map is here:
> <http://www.cs.indiana.edu/eip/compile/roadmap.html>
>
> And the coomplete code is here:
> <http://www.cs.indiana.edu/eip/compile/code.html>
>
> Note that the compiler produces sparc assembler - so they naturally
> included a Sparc emulator written in Scheme...
This is kind of, ..., what you call it, freaky. I have never seen these
pages before yet the code looks almost identical to the code I have in
my compiler. There must be only one way to write Scheme parsers in
Scheme. So, I agree. It's the perfect place to start for Adrian.
> Related papers:
>
> "Destination-Driven Code Generation"
> R. Kent Dybvig, Robert Hieb and Tom Butler.
> <http://www.cs.indiana.edu/~dyb/papers/ddcg.ps>
>
> "Compiler Construction Using Scheme".
> Erik Hilsdale, J. Michael Ashley, R. Kent Dybvig and Daniel P. Friedman.
> <http://repository.readscheme.org/ftp/papers/jmashley/fple95.pdf>
These I've seen, and they are very good papers to read. Just for the
record, the current undergraduate compilers course in IU extends the
5-pass compiler described in the second paper to a 50-pass compiler
that performs many optimizations and uses the graph-coloring register
allocator instead of the destination-driven code generator. The list
of passes making the "P423-compiler" is in:
"A Nanopass Framework for Compiler Education"
Dipanwita Sarkar, Oscar Waddell, and R. Kent Dybvig.
Journal of Functional Programming -- Educational Pearl. (To appear.)
http://www.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf
Aziz,,,
> [...]
> Consider the following benchmark results, obtained
> on a SunBlade 1500 with no other users. All times
> are in seconds, and are the average of three runs:
>
> Chez Larceny MzScheme scheme48
> ******** ******** ******** ********
> benchmark1 0.247 0.569 2.587 2.62
> benchmark2 0.330 1.653 1.821 3.82
>
> benchmark1 just uses read-char to read "nboyer.sch"
> 100 times, with no parsing at all. benchmark2 just
> uses read to parse "nboyer.sch" 100 times.
Comparing them is only useful if they all do the same thing, which is
wrong in this case. The final product of read/read-char/etc would be
the same, but making the process of getting that result deal with
things like non-blocking IO and Unicode can add enough noise to make
these numbers useless.
On top of that, the actual act of reading a character is something
that you must take all the way down to the OS level -- so I tried a
"raw" implementation of benchmark1: a C program that uses system calls
to read the file. The numbers I got (Linux on an Intel machine, 200
reads in each run, median of three runs) are 1.25sec for `read-char',
and 4.27sec for the system call version (with about 25% spent on user
code). So you obviously need some form of buffering, which adds an
additional layer of noise on the numbers. (BTW, benchmark2 runs
slightly slower for me: 1.28sec.)
--
((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay:
http://www.barzilay.org/ Maze is Life!
It appeared in 15(5):653-668.
I'm still anxiously waiting for them to release the code...
Lauri
> This is kind of, ..., what you call it, freaky. I have never seen these
> pages before yet the code looks almost identical to the code I have in
> my compiler. There must be only one way to write Scheme parsers in
> Scheme. So, I agree. It's the perfect place to start for Adrian.
;-)
>> Related papers:
>>
>> "Destination-Driven Code Generation"
>> R. Kent Dybvig, Robert Hieb and Tom Butler.
>> <http://www.cs.indiana.edu/~dyb/papers/ddcg.ps>
>>
>> "Compiler Construction Using Scheme".
>> Erik Hilsdale, J. Michael Ashley, R. Kent Dybvig and Daniel P.
>> Friedman.
>> <http://repository.readscheme.org/ftp/papers/jmashley/fple95.pdf>
>
> These I've seen, and they are very good papers to read.
Indeed. There were several ideas in the workshop code that fell
into place when I read the "Destination-Driven Code Generation"
paper.
--
Jens Axel Søgaard
> "William D Clinger" <cesu...@yahoo.com> writes:
>
>
>>[...]
>>Consider the following benchmark results, obtained
>>on a SunBlade 1500 with no other users. All times
>>are in seconds, and are the average of three runs:
>>
>> Chez Larceny MzScheme scheme48
>> ******** ******** ******** ********
>>benchmark1 0.247 0.569 2.587 2.62
>>benchmark2 0.330 1.653 1.821 3.82
>>
>>benchmark1 just uses read-char to read "nboyer.sch"
>>100 times, with no parsing at all. benchmark2 just
>>uses read to parse "nboyer.sch" 100 times.
>
>
> Comparing them is only useful if they all do the same thing, which is
> wrong in this case. The final product of read/read-char/etc would be
> the same,
And that's what counts for the user: the final product.
> but making the process of getting that result deal with
> things like non-blocking IO and Unicode can add enough noise to make
> these numbers useless.
That's an implementation detail. The end user doesn't care how you get
the chars. I don't think Unicode adds 5~10-fold overhead but I could be
wrong.
> On top of that, the actual act of reading a character is something
> that you must take all the way down to the OS level
Details and not true for buffered IO.
> -- so I tried a
> "raw" implementation of benchmark1: a C program that uses system calls
> to read the file. The numbers I got (Linux on an Intel machine, 200
> reads in each run, median of three runs) are 1.25sec for `read-char',
> and 4.27sec for the system call version (with about 25% spent on user
> code).
Are you saying that C makes a terrible language to implement IO? No
wonder then that the faster implementations do their IO in Scheme.
> So you obviously need some form of buffering, which adds an
> additional layer of noise on the numbers. (BTW, benchmark2 runs
> slightly slower for me: 1.28sec.)
Yes, buffering makes IO faster. Doesn't MzScheme use buffered IO?
Aziz,,,
> Eli Barzilay wrote:
>
> > "William D Clinger" <cesu...@yahoo.com> writes:
> >
>
> >>[...]
> >>Consider the following benchmark results, obtained
> >>on a SunBlade 1500 with no other users. All times
> >>are in seconds, and are the average of three runs:
> >>
> >> Chez Larceny MzScheme scheme48
> >> ******** ******** ******** ********
> >>benchmark1 0.247 0.569 2.587 2.62
> >>benchmark2 0.330 1.653 1.821 3.82
> >>
> >>benchmark1 just uses read-char to read "nboyer.sch"
> >>100 times, with no parsing at all. benchmark2 just
> >>uses read to parse "nboyer.sch" 100 times.
> > Comparing them is only useful if they all do the same thing, which is
> > wrong in this case. The final product of read/read-char/etc would be
> > the same,
>
> And that's what counts for the user: the final product.
Depends on your definition of "final product", and according to the
definition I used, then what counts is much more than the final
product. For example, I might care that the rest of the system
provides certain features even if `read-char' is slow as a result. I
might care about the behavior of the system if `read-char' is used on
an input terminal which is waiting for input (= if it busy-waits, I
don't want it). This whole thing about measuring benchmarks shows
that Will cares about the speed too.
> > but making the process of getting that result deal with things
> > like non-blocking IO and Unicode can add enough noise to make
> > these numbers useless.
>
> That's an implementation detail. The end user doesn't care how you
> get the chars. [...]
Same as above: some users defeinitely care that you *do* get unicode
chars.
> > -- so I tried a "raw" implementation of benchmark1: a C program
> > that uses system calls to read the file. The numbers I got (Linux
> > on an Intel machine, 200 reads in each run, median of three runs)
> > are 1.25sec for `read-char', and 4.27sec for the system call
> > version (with about 25% spent on user code).
>
> Are you saying that C makes a terrible language to implement IO? No
> wonder then that the faster implementations do their IO in Scheme.
You know, I'm all for Scheme propaganda, but this is plain silly. I
wrote some trivial C code to measure char-by-char reading speed. It
took more than 4 seconds, and 3 of that was spent in system time.
That was my point.
> > So you obviously need some form of buffering, which adds an
> > additional layer of noise on the numbers. (BTW, benchmark2 runs
> > slightly slower for me: 1.28sec.)
>
> Yes, buffering makes IO faster. Doesn't MzScheme use buffered IO?
Of course -- using read-char was faster than the raw system-calls
code. Since all Scheme's have a faster than system-call `read-char',
then they all use some buffering. This means that there is another
layer of dealing with IO before `read-char' -- that layer can be
implemented in very different ways (what kind of buffering? how big?
what system call is used to read/write chunks?). That layer is
probably different in different implementation, which means more noise
that in this case can be large enough to render the original measures
meaningless.
Let's get real. Do you honestly believe that Unicode is the
reason MzScheme's read-char is so slow?
I'll give you a hint: MzScheme's read-char is about as fast
as scheme48's read-char.
> You know, I'm all for Scheme propaganda, but this is plain silly.
We're having fun here, but our point is that it is plain
silly to think you can improve the speed of a parser by
writing it in C or C++. A secondary point is that the
main reason freshpeople *think* writing it in C or C++
is faster is because they are accustomed to comparing the
speed of compiled C/C++ against interpreted Scheme.
> Of course -- using read-char was faster than the raw system-calls
> code. Since all Scheme's have a faster than system-call `read-char',
> then they all use some buffering. This means that there is another
> layer of dealing with IO before `read-char' -- that layer can be
> implemented in very different ways (what kind of buffering? how big?
> what system call is used to read/write chunks?). That layer is
> probably different in different implementation, which means more noise
> that in this case can be large enough to render the original measures
> meaningless.
Do you honestly believe MzScheme's read-char is slow
because of poor buffering?
Will
Parser generators often produce bottom-up parsers, which are pretty
difficult to modify on the fly. So if run-time reader extensions are
desired, then a custom recursive-descent parser makes more sense,
especially since the s-expression grammar is (easiry transformed into)
LL(1).
Of course an extensible LL(1) parser could be produced by a generator,
too, but I don't think such tools are very common.
Lauri
>>Eli Barzilay wrote:
> Depends on your definition of "final product", and according to the
> definition I used, then what counts is much more than the final
> product. For example, I might care that the rest of the system
> provides certain features even if `read-char' is slow as a result. I
> might care about the behavior of the system if `read-char' is used on
> an input terminal which is waiting for input (= if it busy-waits, I
> don't want it). This whole thing about measuring benchmarks shows
> that Will cares about the speed too.
So, how much of the overhead of read-char is due to using non-blocking
IO in MzScheme?
>>>but making the process of getting that result deal with things
>>>like non-blocking IO and Unicode can add enough noise to make
>>>these numbers useless.
>>
>>That's an implementation detail. The end user doesn't care how you
>>get the chars. [...]
>
> Same as above: some users defeinitely care that you *do* get unicode
> chars.
How much does Unicode support cost read-char? I honestly would like to
know.
>>>-- so I tried a "raw" implementation of benchmark1: a C program
>>>that uses system calls to read the file. The numbers I got (Linux
>>>on an Intel machine, 200 reads in each run, median of three runs)
>>>are 1.25sec for `read-char', and 4.27sec for the system call
>>>version (with about 25% spent on user code).
>>
>>Are you saying that C makes a terrible language to implement IO? No
>>wonder then that the faster implementations do their IO in Scheme.
>
> You know, I'm all for Scheme propaganda, but this is plain silly. I
> wrote some trivial C code to measure char-by-char reading speed. It
> took more than 4 seconds, and 3 of that was spent in system time.
> That was my point.
I was not talking about ``Oh Scheme is the best/fastest language''. I'm
not for any propaganda. It's just that in my experience, I can squeeze
more performance out of Scheme when given a decent implementation than I
can using C. Consequently, I was agreeing with Will that implementing a
Scheme parser in Scheme is the key to getting good performance.
I still don't know what your point is regarding the trivial "raw" C code
that you wrote that spends 75% of its time in the system. My compiler
spends 15% of its time in the system. Yes, it's a different system, but
what's your point? You are not blaming the system for spending that
time, are you?
>>>So you obviously need some form of buffering, which adds an
>>>additional layer of noise on the numbers. (BTW, benchmark2 runs
>>>slightly slower for me: 1.28sec.)
>>
>>Yes, buffering makes IO faster. Doesn't MzScheme use buffered IO?
>
> Of course -- using read-char was faster than the raw system-calls
> code. Since all Scheme's have a faster than system-call `read-char',
> then they all use some buffering. This means that there is another
> layer of dealing with IO before `read-char' -- that layer can be
> implemented in very different ways (what kind of buffering? how big?
> what system call is used to read/write chunks?). That layer is
> probably different in different implementation, which means more noise
> that in this case can be large enough to render the original measures
> meaningless.
Maybe I'm very slow today. Are you saying that if one system spends
0.247 secs reading chars and another spends 2.587 secs, then you cannot
say anything about their performance and the whole difference in the
measurement is noise?
Aziz,,,
> Let's get real. Do you honestly believe that Unicode is the reason
> MzScheme's read-char is so slow?
No.
> > You know, I'm all for Scheme propaganda, but this is plain silly.
>
> We're having fun here, but our point is that it is plain silly to
> think you can improve the speed of a parser by writing it in C or
> C++.
I never said anything about parsers that use C/++ vs ones that use
Scheme. I only said that comparing IO behavior from implementations
that have wildly varying capabilities does not make much of a point.
> A secondary point is that the main reason freshpeople *think*
> writing it in C or C++ is faster is because they are accustomed to
> comparing the speed of compiled C/C++ against interpreted Scheme.
That's sort of right (I don't think that there are any major Scheme
implementation alive that do straightforward interpretation), but in
any case I had no argument in that direction. (And if I would make
any point in that area, then it should be clear what I'd favor.)
> > Of course -- using read-char was faster than the raw system-calls
> > code. Since all Scheme's have a faster than system-call
> > `read-char', then they all use some buffering. This means that
> > there is another layer of dealing with IO before `read-char' --
> > that layer can be implemented in very different ways (what kind of
> > buffering? how big? what system call is used to read/write
> > chunks?). That layer is probably different in different
> > implementation, which means more noise that in this case can be
> > large enough to render the original measures meaningless.
>
> Do you honestly believe MzScheme's read-char is slow because of poor
> buffering?
Not directly, but I believe that the IO capabilities I get in MzScheme
are rich enough that comparing it's `read-char' is not fair. First of
all, I get enough higher-level primitives that I think it has been
years since I used `read-char'. Secondly, even if doing some things
in C makes them faster, I'd still go with Scheme, and the exact same
argument holds for choosing a Scheme implementation -- there are some
things that make me choose an implementation even if it's slower.
Abdulaziz Ghuloum <aghu...@c-s-remove-dashes.indiana.edu> writes:
> So, how much of the overhead of read-char is due to using
> non-blocking IO in MzScheme?
I really don't know. The IO part of MzScheme is exceptionally hairy
(esp given that it works on different OSs). The only way I can think
of doing a fair comparison is to use the same underlying IO
implementation, and that's way too much work. (I actually made a
brief attempt with a Scheme interface for reading characters through a
raw system call, and then I discovered the horrendous time spent when
you do that.)
> > Same as above: some users defeinitely care that you *do* get unicode
> > chars.
>
> How much does Unicode support cost read-char? I honestly would like
> to know.
That's easy to try -- you just need to compare `read-char' and
`read-byte'. IIRC, the extra cost was indeed negligible, but like I
said above I just brought it as an example that a naive time
comparison is not saying much. (In any case, `read-char' is not as
useful when you have what you get with MzScheme.)
> I was not talking about ``Oh Scheme is the best/fastest language''.
> I'm not for any propaganda. It's just that in my experience, I can
> squeeze more performance out of Scheme when given a decent
> implementation than I can using C. Consequently, I was agreeing
> with Will that implementing a Scheme parser in Scheme is the key to
> getting good performance.
I'm really not convinced about that, certainly not by the posted
numbers. A bad Scheme programmer is still a bad programmer and will
generate a crappy parser, and a good C programmer is still a good C
programmer and will generate a good parser. For a fair comparison,
I'd sit down and go over a Scheme parser, and then I'd translate the
code to C -- that can get to a proper answer for why a Scheme faster
should be faster than a C parser, or to comparable C code that is
natural enough to have been used in the first place.
> I still don't know what your point is regarding the trivial "raw" C
> code that you wrote that spends 75% of its time in the system. My
> compiler spends 15% of its time in the system. Yes, it's a
> different system, but what's your point? You are not blaming the
> system for spending that time, are you?
No, I'm saying that there must be some buffering, and that buffering
will have a huge effect on the time, and since each implementation is
likely to use a different buffering the noise that is inserted to
timing comparisons makes it useless.
> Maybe I'm very slow today. Are you saying that if one system spends
> 0.247 secs reading chars and another spends 2.587 secs, then you cannot
> say anything about their performance and the whole difference in the
> measurement is noise?
The numbers were 0.330, 1.653, 1.821, 3.82, and yes, I'm saying that
using these numbers to draw such strong conclusions as "Scheme parsers
are faster than C parsers" is bogus.
> "Compiler Construction Using Scheme".
> Erik Hilsdale, J. Michael Ashley, R. Kent Dybvig and Daniel P. Friedman.
> <http://repository.readscheme.org/ftp/papers/jmashley/fple95.pdf>
Thanks for the links! I've been reading through some of these today
with a focus on scanning and parsing rather than code generation. The
above paper specifies a scanner using a deterministic finite automata
and a recursive descent parser. Are these considered the fastest way
to scan and parse scheme code? Or are they used in the paper simply
because they are either the simplest, or most elegant and
instructional?
I guess you're not getting the clues, so, let me explain.
MzScheme is not slow because of Unicode. MzScheme is not slow because
of bad IO buffers management. MzScheme is not slow because it targets
many platforms. MzScheme is slow because it's an interpreter. Whether
you want to call it a straight-forward interpreter or an optimizing
interpreter, it's still an interpreter. All interpreters suffer the
same artifact: implementor's code is fast and user's code is slow.
Let me spell all the benchmarks I ran using MzScheme and not comparing
it to any other implementation.
1. Reading&Parsing nboyer.sch 100 times:
cpu time: 2600 real time: 2748 gc time: 110
2. Reading nboyer.sch 100 times using read-char:
cpu time: 3560 real time: 3900 gc time: 10
(you said this was due to IO, non-blocking, and Unicode)
3. Reading nboyer.sch 100 times using read-byte:
cpu time: 3330 real time: 3531 gc time: 10
(this eliminates the Unicode overhead, but with the same results)
4. Reading nboyer.sch 100 times from a string-port using read-char:
cpu time: 3710 real time: 3920 gc time: 130
(this eliminates all the IO/nonblocking overhead, but same results)
5. Reading nboyer.sch 100 times from a string-port using read-byte:
cpu time: 3550 real time: 3694 gc time: 130
(this eliminates both Unicode and IO/nonblocking overhead, same results)
6. Iterating through a string containing nboyer.sch 100 times:
cpu time: 3620 real time: 3830 gc time: 0
(this bypasses the entire IO layer, but same results!)
So, MzScheme is slow because it doesn't like executing my code; it would
rather execute code that's built-in. The "I get enough higher-level
primitives that I think it has been years since I used `read-char'" does
not fly because you cannot dream that one implementation will provide
all the primitives that all its users will ever need. Users will always
need to write code to express whatever problem they're trying to solve.
It is sad how often and soon users of these ``high-level'' interpreters
face reality and get forced down to the C level.
Aziz,,,
PS. Here is my code if you want to rerun under your machine.
(define (test-read)
(let ([p (open-input-file "nboyer.sch")])
(let f ((p p))
(let ([x (read p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (test-read-char)
(let ([p (open-input-file "nboyer.sch")])
(let f ((p p))
(let ([x (read-char p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (test-read-byte)
(let ([p (open-input-file "nboyer.sch")])
(let f ((p p))
(let ([x (read-byte p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (file->string fname)
(let ([p (open-input-file "nboyer.sch")]
[o (open-output-string)])
(let f ((p p))
(let ([x (read-char p)])
(cond
[(eof-object? x)
(close-input-port p)
(get-output-string o)]
[else
(write-char x o)
(f p)])))))
(define nboyer-string (file->string "nboyer.sch"))
(define (test-read-byte)
(let ([p (open-input-file "nboyer.sch")])
(let f ((p p))
(let ([x (read-byte p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (test-read-char-from-string)
(let ([p (open-input-string nboyer-string)])
(let f ((p p))
(let ([x (read-char p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (test-read-byte-from-string)
(let ([p (open-input-string nboyer-string)])
(let f ((p p))
(let ([x (read-byte p)])
(cond
[(eof-object? x) (close-input-port p)]
[else (f p)])))))
(define (test-string-ref)
(let f ((i 0))
(cond
[(= i (string-length nboyer-string)) 'nothing]
[else
(string-ref nboyer-string i)
(f (add1 i))])))
(define (many f n)
(unless (zero? n)
(f)
(many f (- n 1))))
(time (many test-read 100))
(time (many test-read-char 100))
(time (many test-read-byte 100))
(time (many test-read-char-from-string 100))
(time (many test-read-byte-from-string 100))
(time (many test-string-ref 100))
Related to the previous post. I'll take it that you are a good Scheme
programmer and a good programmer in general. Can you write a good (as
in fast) parser in Scheme using your implementation of choice? And how
would you go about doing that? This question is to go back on-topic.
Aziz,,,
> Eli Barzilay wrote:
> > ...
>
> I guess you're not getting the clues, so, let me explain.
>
> MzScheme is not slow because of Unicode. MzScheme is not slow
> because of bad IO buffers management. MzScheme is not slow because
> it targets many platforms. MzScheme is slow because it's an
> interpreter. Whether you want to call it a straight-forward
> interpreter or an optimizing interpreter, it's still an interpreter.
It has always performed compilation to byte code, is it still an
interpreter? Recently Matthew added a lightning-based JIT that
converts the byte code to native code, is it still an interpreter? If
you're answer is still yes, then I wonder what's your definition of
"interpreter".
> Let me spell all the benchmarks I ran using MzScheme and not
> comparing it to any other implementation.
> [...]
Just for your amusement, I ran your code in four forms. Single run
each since I'm getting tired of this, these are the cpu times:
(1) (2) (3) (4)
618 637 620 632
746 728 551 478
694 676 519 446
878 814 644 561
755 756 609 519
815 798 358 100
(1) your original code, as is (all of your tests unmodified)
(2) your code, wrapped in a (module ...)
(3) same as (1), running in the svn version with the lightning jit
(4) same as (2), running in the svn version with the lightning jit
What does that mean in regards to your argument? I really don't know
since I don't know what your argument is. My only argument so far did
not change -- timing different implementations against each other is
mostly useless.
> 4. Reading nboyer.sch 100 times from a string-port using read-char:
> cpu time: 3710 real time: 3920 gc time: 130
> (this eliminates all the IO/nonblocking overhead, but same results)
(It does not -- reading from a string from a user point of view should
be like reading from a file, so string ports are much more than a
simple port+index.)
> So, MzScheme is slow because it doesn't like executing my code; it
> would rather execute code that's built-in. The "I get enough
> higher-level primitives that I think it has been years since I used
> `read-char'" does not fly because you cannot dream that one
> implementation will provide all the primitives that all its users
> will ever need. Users will always need to write code to express
> whatever problem they're trying to solve. It is sad how often and
> soon users of these ``high-level'' interpreters face reality and get
> forced down to the C level.
I guess you're not getting the clues, so, let me explain. When I said
that I didn't use read-char in years, I wasn't trying to compare
MzScheme's high-levelness to other implementation's
whatever-levelness, I wasn't trying to compare compiled code,
bytecompiled code, and interpreted code, I wasn't trying to compare C
vs Scheme, or even raw C vs C + libc. I only said getting a single
character is unuseful enough that I hadn't used it in years. I would,
obviously, use it if my implementation had only that as a way to read
files. (IIRC, the Linux kernel interface doesn't have a get-char
operation.)
> Eli Barzilay wrote:
> > I'm really not convinced about that, certainly not by the posted
> > numbers. A bad Scheme programmer is still a bad programmer and
> > will generate a crappy parser, and a good C programmer is still a
> > good C programmer and will generate a good parser. For a fair
> > comparison, I'd sit down and go over a Scheme parser, and then I'd
> > translate the code to C -- that can get to a proper answer for why
> > a Scheme faster should be faster than a C parser, or to comparable
> > C code that is natural enough to have been used in the first
> > place.
>
> Related to the previous post. I'll take it that you are a good
> Scheme programmer and a good programmer in general. Can you write a
> good (as in fast) parser in Scheme using your implementation of
> choice?
I can only assume so, but won't have time for it.
> And how would you go about doing that?
I'd probably start from the reading list that Jens posted, since I
appreciate years of work people have spent in getting fast
implementations.
> This question is to go back on-topic.
Whew, thanks.
(And if I ever would do that, then to get right back to the off-topic,
I'd really want to investigate if there's anything that makes Scheme a
faster language than C for such code. I really, really wouldn't mind
having some concrete points for Scheme in the form of faster code.)
It's standard to use finite automatas as scanners. Some write them
by hand, and others generate them using "lex" or similar.
Recursive descent is a standard method. It is simple and, as Aziz
and Ray writes, easy to extend. Whether or not it is
(theoretically speaking) the fastest method, I don't know. But is
it important? The other phases of the compiler will most likely
me considerably slower. So slow that making the parser 10% faster
hardly will be noticeable by the end user.
Perhaps the other participants in the thread have hard numbers
for their compilers?
--
Jens Axel Søgaard
> I'd really want to investigate if there's anything that makes Scheme a
> faster language than C for such code.
Please do share your findings with us.
Aziz,,,
> Parser generators often produce bottom-up parsers, which are pretty
> difficult to modify on the fly. So if run-time reader extensions are
> desired, then a custom recursive-descent parser makes more sense,
> especially since the s-expression grammar is (easiry transformed into)
> LL(1).
>
> Of course an extensible LL(1) parser could be produced by a generator,
> too, but I don't think such tools are very common.
You don't really need a tool like that; you can just have a hook that
calls your user-defined code whenever the parser spits out something
as a lexical error and use any kind of base parser you want.
Bear
As I probably said elsewhere in the tread, when I implemented the
reader for my compiler, I did not have efficiency in mind. I just
wanted to write the most straight-forward code that takes the least
time to write so that I can get to the other parts of the compiler. I
wrote it as a "throw-away" code thinking that it will eventually be
rewritten once I put the rest of the compiler in good shape. So, it
was written as a quick and dirty solution to get the compiler to read
its source. The parser and the IO library (input/output file/string
ports) were written over a weekend perhaps. Needless to say that it
ran very slow at the time. It probably took a few seconds to read its
own source, but it served the purpose.
Fast-forward one year, fix the other parts of the compiler
(generational garbage collector, source-level optimizer, better
register-allocator, ...), and the parser/IO library automagically
became so much faster that there is still no point in rewriting them.
I prefer to optimize the rest of the compiler and let the parser ride
on the benefits rather than optimize the parser (by writing it in C or
whatever) to hide the inadequacies of the compiler. Also, reflecting
on this thread, I see no evidence that there is a faster way to write
a parser than to write it in Scheme and let your compiler do its
magic.
So, to Adrian, I recommend that you just write your parser in the most
straight-forward manner that's correct and pick a decent
implementation that delivers decent performance. I don't know which
implementation to recommend since my experience with good compilers is
limited to Chez Scheme (not free, unless you have access to it from
your school, assuming you're in school) and my compiler (not available
at this point in time). Writing a straight-forward parser would also
allow you to easily move from one implementation to another if you
want to evaluate their relative performance.
Aziz,,,
Recursive descent parsing is O(n), which is the fastest
possible. There is a moderately large (and now obsolete)
number of papers that claim recursive descent (and other
forms of LL(k) parsing) is faster than LR parsing, and an
equally large (and equally obsolete) literature claiming
the opposite. It is now generally accepted that all of
the standard O(n) methods, including recursive descent,
are equally fast in theory, and that any differences
that might be observed in practice are attributable to
interactions between programming style, the compilers
used to compile the parser, and the hardware used to run
the parser.
For those who haven't understood why so much of this thread
has been about performance of compilers and interpreters:
that is why. The parsing algorithm just doesn't matter
very much, so long as it's O(n).
> But is
> it important? The other phases of the compiler will most likely
> me considerably slower. So slow that making the parser 10% faster
> hardly will be noticeable by the end user.
That is the main reason I cited hard numbers. In most
systems, the speed of parsing is limited by the speed of
character i/o.
MzScheme and Larceny are two exceptions. The fact that
MzScheme's parser is *faster* than character i/o is both
amusing and instructive. The fact that Larceny's parser
is so slow, yet still outruns a parser written in C/C++,
is also instructive.
> Perhaps the other participants in the thread have hard numbers
> for their compilers?
We can use load to approximate the compile time, because
"nboyer.sch" performs virtually no computation as it is
loaded. Average of 3 runs on a slower Solaris machine
than the one I used last time:
Chez Larceny MzScheme
Scheme fast-safe (default)
opt=2
load 4.54 89.12 11.42
read .68 3.35 3.67
read-char (file) .37 .99 4.56
read-char (string) .22 1.08 4.68
string-ref .78 .12 4.73
read/load 15 % 4 % 32 %
Given Larceny's very slow compiler, its slow parser is fast
enough. Larceny's parser would not be fast enough for Chez
Scheme, however, because Chez Scheme's compiler is extremely
fast.
In the opinion of those who implemented MzScheme, a parser
written in interpreted MzScheme would not be fast enough.
Even though MzScheme's parser is written in C/C++, it still
accounts for a third of the compile time.
Will
****************************************************************
(define (test-load)
(load "nboyer.sch"))
(define (test-read)
(let ((p (open-input-file "nboyer.sch")))
(let f ((p p))
(let ((x (read p)))
(cond
((eof-object? x) (close-input-port p))
(else (f p)))))))
(define (test-read-char)
(let ((p (open-input-file "nboyer.sch")))
(let f ((p p))
(let ((x (read-char p)))
(cond
((eof-object? x) (close-input-port p))
(else (f p)))))))
(define (file->string fname)
(let ((p (open-input-file "nboyer.sch"))
(o (open-output-string)))
(let f ((p p))
(let ((x (read-char p)))
(cond
((eof-object? x)
(close-input-port p)
(get-output-string o))
(else
(write-char x o)
(f p)))))))
(define nboyer-string (file->string "nboyer.sch"))
(define (test-read-char-from-string)
(let ((p (open-input-string nboyer-string)))
(let f ((p p))
(let ((x (read-char p)))
(cond
((eof-object? x) (close-input-port p))
(else (f p)))))))
(define (test-read-byte-from-string)
(let ((p (open-input-string nboyer-string)))
(let f ((p p))
(let ((x (read-byte p)))
(cond
((eof-object? x) (close-input-port p))
(else (f p)))))))
(define (test-string-ref)
(let f ((i 0))
(cond
((= i (string-length nboyer-string)) 'nothing)
(else
(string-ref nboyer-string i)
(f (+ 1 i))))))
(define (many f n)
(if (not (zero? n))
(begin (f)
(many f (- n 1)))))
(define (run-benchmarks n)
(define (announce fname)
(display "Benchmarking ")
(write fname)
(newline))
(many (lambda ()
(announce "load")
(time (many test-load 100))
(announce "read")
(time (many test-read 100))
(announce "read-char (file)")
(time (many test-read-char 100))
; (time (many test-read-byte 100))
(announce "read-char (string)")
(time (many test-read-char-from-string 100))
; (time (many test-read-byte-from-string 100))
(announce "string-ref")
(time (many test-string-ref 100)))
n))
The "And if I ever would do that" was an important qualifier for the
above sentence.
In Larceny, on a SPARC, when reading a UTF-8 file consisting
entirely of ASCII characters, the cost for UTF-8 support is two
machine instructions per character (a compare and conditional
branch).
In ten measurements of the cost of reading nboyer.sch 100
times, Larceny's ASCII-only version of read-char ran your
benchmark in an average of 737 milliseconds, while a UTF-8
version ran in an average of 733 milliseconds. The sample
deviations were 166 and 74 milliseconds respectively, so
the cost of Unicode support is way too small to measure
with this kind of benchmark.
(To those who don't know: On many microprocessors, such
things as the detailed alignment of basic blocks and branch
targets on various multiword boundaries can make a big
difference, so adding a couple of instructions is almost as
likely to make this kind of benchmark run faster as to make
it run slower. Worrying about this level of micro-performance
is insane.)
Thank you, BTW, for your message in which you dismissed
the cost of Unicode and other obfuscatory red herrings that
have been raised in this thread. Your advice to the original
poster was exactly right.
Will
Recently I've been writing a C program this program needed to read and
write quite complex data. I decided to do this by embedding a scheme
reader and printer into it. I haven't implemented all of scheme read
syntax, the only big things I've missed though are the obscure number
formats and quasiquotation. I did mine using a non-deterministic
finite automata lexer and a small recursive-descent parser. Both
manually written in C. I haven't measured how fast it is, but it seems
OK.
A few notes, on this, some from doing this some from other parsers:-
* As others have pointed out IO before lexing is normally the
bottleneck.
* An IO bottleneck at the OS or library level can be avoided by reading
a large block of data at once into the program, then going through it
character by character.
* As others also pointed out parsing scheme is easy, lexing scheme is
harder though because of all the different number formats. My parser
has only 5 functions.
* I did not use flex, I wrote a manual NFA, which is complicated.
Avoid it if you can, I couldn't.
* Other character sets like Unicode can cause big problems unless
handled correctly. For example, if you use a DFA to do the lexing
then don't make it use 16-bit Unicode. If it does then it's tables
will be much bigger. Instead you can normally do the lexing on byte
characters by doing some conversion. There are other possibilities.
* Often parser speed does not matter because other things are slower,
such as networks files are stored on and code that uses the data
* When parsing programs there are lots of small files involved, so
quite a bit of time is spent opening and closing files rather than
reading them. The dreaded virus checker can make this slow on
windows.
The problem used to be that LALR parsing is table driven. Recursive
descent uses stack frames to store the same data, and generally lots of
stack space. On many old architectures there was a hard limit on the
size of the stack (eg 8086).
So, if you wanted to write an LL parser you had to make in table-driven
anyway, or you would run out of memory. But theres little point
writing an LL parser that way since using an LALR generator is easier.
These days, with no limits on stacks the problem is irrelevant.
> For those who haven't understood why so much of this thread
> has been about performance of compilers and interpreters:
> that is why. The parsing algorithm just doesn't matter
> very much, so long as it's O(n).
To an extent yes, hopefully the constant factor shouldn't be that bad
though.
> > But is
> > it important? The other phases of the compiler will most likely
> > me considerably slower. So slow that making the parser 10% faster
> > hardly will be noticeable by the end user.
>
> That is the main reason I cited hard numbers. In most
> systems, the speed of parsing is limited by the speed of
> character i/o.
Part of the reason people used to look at parser speed is because IO
used to be relatively faster than it is now. In recent times memory
and CPU speeds have gone up faster than IO latency.
> William D Clinger wrote:
>>For those who haven't understood why so much of this thread
>>has been about performance of compilers and interpreters:
>>that is why. The parsing algorithm just doesn't matter
>>very much, so long as it's O(n).
>
> To an extent yes, hopefully the constant factor shouldn't be
> that bad though.
Yes. There is the constant that's part of the different parsing
algorithms. There is a more important constant that's part of the
implementation you use to implement the parser. If your implementation
hits you with a 100-fold slowdown (not MzScheme), then you will produce
a crappy parser(TM), no matter how brilliant a programmer you are or how
clever your parsing algorithm is.
Aziz,,,
At Eli's request, I used my own LexGen and ParseGen
tools to generate a portable lexical analyzer and
parser for the subset of Scheme used in nboyer.sch,
and implemented character-level input from a string
representation of nboyer.sch. This gets rid of all
the variations between implementations of read-char
that were troubling Eli.
Here's the average of three runs, in seconds, parsing
nboyer.sch 1000 (I repeat, one thousand) times from a
string on a particular Solaris machine:
parse time sample relative
deviation performance
Chez Scheme 11.3 .06 100 %
v6.1 opt=2
Larceny 19.0 .02 59 %
v0.91a
MzScheme 277.0 1.3 4 %
v301
MzScheme 154.8 .4 7 %
v301.12
MzScheme v301.12 uses a just-in-time (JIT) compiler,
which provides a worthwhile improvement over interpreted
MzScheme: a factor of almost 2 on this benchmark.
The real value of this benchmark will come when I use
it to benchmark the C and Java code that LexGen and
ParseGen generate. I'll have to write some support
code first, though, and I didn't have time for that
today.
Will
First of all, going back to the original times that Will posted,
Matthew said that the IO features that makes `read-char' slower are
support for multiple port types, arbitrary peek-ahead, non-character
values, progress events, port position counting, etc. These features
pile up branches in the code, with no common-case shortcuts. (Another
point that Matthew raised is that most of "nboyer.sch" is whitespaces
and comments, so the `read' time is dominated by `read-char' -- if you
look at the `read'-`read-char' deltas, you get the time that is
actually spent on parsing -- Chez's parser being really fast, and
Larceny's being around the neighborhood of Scheme48. To do this for
MzScheme too, the time it takes for a C char-by-char reading should be
subtracted, but I'm over my quota of spending time on this...)
As for the code that Will produced and timed: first a clarification --
Will ran the code on the new MzScheme which has a JIT, but not for
Sparc. The speedup that Will's numbers show is due to some recent
optimization like loop unrolling and inlining that happen at the
bytecode level. I ran Will's code in MzScheme v301, in v301.12 with
the JIT disabled, and in v301.12 with the JIT enabled; I repeated the
whole thing when the code is put in a module (which allows more
optimizations). Each test is the same as Will's (1000 reads of
"nboyer.sch"), on a two year old Linux with a single 2.8GHz CPU.
Repeated 5 times, and averaged the middle three results. (BTW, I
needed to slightly fix the code -- there was one undefined `nextToken'
identifier; also the behavior was different than MzScheme's in that it
didn't parse `1-' as a single symbol, so I changed "nboyer.sch" to
have `sub1' instead.)
The results are:
toplevel module
301 144.4(*) 141.1
301.12(JIT disabled) 75.3(*) 71.5
301.12(+ JIT) 29.6 21.3
The two (*) numbers correspond to Will's numbers, and the ratio is
roughly the same as in Will's table. So, extending Will's table will
give:
> parse time sample relative
> deviation performance
>
> Chez Scheme 11.3 .06 100 %
> v6.1 opt=2
> Larceny 19.0 .02 59 %
> v0.91a
> MzScheme 277.0 1.3 4 %
> v301
> MzScheme 154.8 .4 7 %
> v301.12
| MzScheme 43.8 26 %
| v301.12
| +JIT, in a module
and the speedup factor of everything that went into the new version of
MzScheme is more than 6.
(Disclaimer: this text is posted as is, without warranty of any kind.
Conclusions, if you see any, are not intended and purely accidental.)
> First of all, going back to the original times that Will posted,
> Matthew said that the IO features that makes `read-char' slower are
> support for multiple port types, arbitrary peek-ahead, non-character
> values, progress events, port position counting, etc. These features
> pile up branches in the code, with no common-case shortcuts.
For those unfamiliar with Chez Scheme, it's worth noting that it
provides all of the above-mentioned features, except arbitrary
peek-ahead probably, and at the same time remains the fastest of
the bunch.
More info about Chez's IO is at: http://www.scheme.com/csug7/io.html
> (Another
> point that Matthew raised is that most of "nboyer.sch" is whitespaces
> and comments, so the `read' time is dominated by `read-char' -- if you
> look at the `read'-`read-char' deltas, you get the time that is
> actually spent on parsing
Wouldn't that be negative, since read took less time than read-char?
(I'm looking at the original figures that Will cited in
Message-ID: <1144427079.3...@z34g2000cwc.googlegroups.com>)
Aziz,,,
> Eli Barzilay wrote:
>
> > First of all, going back to the original times that Will posted,
> > Matthew said that the IO features that makes `read-char' slower
> > are support for multiple port types, arbitrary peek-ahead,
> > non-character values, progress events, port position counting,
> > etc. These features pile up branches in the code, with no
> > common-case shortcuts.
>
> For those unfamiliar with Chez Scheme, it's worth noting that it
> provides all of the above-mentioned features, except arbitrary
> peek-ahead probably, [...] http://www.scheme.com/csug7/io.html
(I don't see any mention of port types like network ports, pipe ports,
etc, non-character values are arbitrary Scheme values that can be
embedded in ports in MzScheme (eg picture values in DrScheme), nothing
on events (MzScheme has most of CML, with synchronizable event
objects), and port-position refers to counting lines and columns which
I don't see either.)
> > (Another point that Matthew raised is that most of "nboyer.sch" is
> > whitespaces and comments, so the `read' time is dominated by
> > `read-char' -- if you look at the `read'-`read-char' deltas, you
> > get the time that is actually spent on parsing
>
> Wouldn't that be negative, since read took less time than read-char?
> (I'm looking at the original figures that Will cited in
> Message-ID: <1144427079.3...@z34g2000cwc.googlegroups.com>)
(Yes it would, just like I said in the part that you did not quote:
> To do this for MzScheme too, the time it takes for a C char-by-char
> reading should be subtracted, [...]
)
(But that was in parenthesis because I didn't want to say more on
that, and this post is all in parenthesis since I really don't want
(and did not try) to get into a pissing contest.)
> (But that was in parenthesis because I didn't want to say more on
> that, and this post is all in parenthesis since I really don't want
> (and did not try) to get into a pissing contest.)
And you're putting this in parenthesis because you don't want to say
more on it. Got it. Sorry, I wasn't aware of this convention.
(Sorry if you felt you were being dragged into a pissing contest.
Remember that participating in contests is a voluntary action.)
Aziz,,,
BTW
I can post this implementation of read if anyone is interested,
copyright is not an issue.
That's unfortunate. Chez Scheme and Larceny support
extensible port types and various other features without
slowing the common case for read-char. Most systems
programmers try to make users pay for features only if
they use them, while optimizing the common cases. I
would recommend these principles to the original poster.
> (Another
> point that Matthew raised is that most of "nboyer.sch" is whitespaces
> and comments, so the `read' time is dominated by `read-char'
Whitespace and comments are indeed predominant in
well-written code, which is why compiler textbooks stress
the importance of scanning them quickly. Obviously, you
can't write a fast lexical analyzer without fast i/o.
And now for more plain silliness...
> (BTW, I
> needed to slightly fix the code -- there was one undefined `nextToken'
> identifier; also the behavior was different than MzScheme's in that it
> didn't parse `1-' as a single symbol, so I changed "nboyer.sch" to
> have `sub1' instead.)
nboyer.sch does not contain any occurrences of 1- as a
symbol. It did at one time, but that bug was corrected on
4 April 2001.
The nextToken procedure is never called by this parsing
benchmark, but I will remove the call to it in case some
compiler complains about references to undefined variables.
> > parse time sample relative
> > deviation performance
> >
> > Chez Scheme 11.3 .06 100 %
> > v6.1 opt=2
> > Larceny 19.0 .02 59 %
> > v0.91a
> > MzScheme 277.0 1.3 4 %
> > v301
> > MzScheme 154.8 .4 7 %
> > v301.12
> | MzScheme 43.8 26 %
> | v301.12
> | +JIT, in a module
I am pleased to see that MzScheme can parse from a string
about as fast as it can read characters using read-char.
Or does the JIT compiler speed up the read-char benchmark
as well?
You asked me to write this benchmark to remove dependence
upon the i/o subsystem, and also to ensure that all systems
are running exactly the same Scheme source code. Doesn't
your use of MzScheme's module construct sort of defeat the
purpose, from your alleged point of view?
Will
Every NFA is equivalent to some DFA, and DFAs are not
only easy to implement in Scheme, they are usually a
good deal more efficient when implemented in Scheme
than when implemented in C/C++/Java/whatever, which
must simulate the tail recursion of the more natural
implementation.
Before this thread is done, I will post links to my
LexGen and ParseGen tools, which generate efficient
lexers and parsers written in C, Java, or Scheme.
> * Other character sets like Unicode can cause big problems unless
> handled correctly. For example, if you use a DFA to do the lexing
> then don't make it use 16-bit Unicode. If it does then it's tables
> will be much bigger. Instead you can normally do the lexing on byte
> characters by doing some conversion. There are other possibilities.
The DFA will run faster if implemented procedurally
than if implemented as a table interpreter. When the
DFA is implemented procedurally, using Unicode (whether
16 or 21 bits) will make no measurable difference.
> Part of the reason people used to look at parser speed is because IO
> used to be relatively faster than it is now. In recent times memory
> and CPU speeds have gone up faster than IO latency.
How do you reconcile that with Eli's claim that the
read-char procedure, as implemented in MzScheme, is
slow because of CPU branch instructions and the like?
Will
> nboyer.sch does not contain any occurrences of 1- as a symbol. It
> did at one time, but that bug was corrected on 4 April 2001.
(I must have found an old copy then. I found the problem because your
benchmark was also verifying the correct result.)
> The nextToken procedure is never called by this parsing benchmark,
> but I will remove the call to it in case some compiler complains
> about references to undefined variables.
(Using modules in MzScheme does complain about any undefined
identifier.)
> > > parse time sample relative
> > > deviation performance
> > >
> > > Chez Scheme 11.3 .06 100 %
> > > v6.1 opt=2
> > > Larceny 19.0 .02 59 %
> > > v0.91a
> > > MzScheme 277.0 1.3 4 %
> > > v301
> > > MzScheme 154.8 .4 7 %
> > > v301.12
> > | MzScheme 43.8 26 %
> > | v301.12
> > | +JIT, in a module
>
> I am pleased to see that MzScheme can parse from a string about as
> fast as it can read characters using read-char. Or does the JIT
> compiler speed up the read-char benchmark as well?
Yes, but not much. The same test configurations have these numbers:
in module
v301 10.6 9.8
v301.12 -JIT 10.7 8.6
v301.12 +JIT 6.5 5.5
> You asked me to write this benchmark to remove dependence upon the
> i/o subsystem, and also to ensure that all systems are running
> exactly the same Scheme source code. Doesn't your use of MzScheme's
> module construct sort of defeat the purpose, from your alleged point
> of view?
Using a module is just a linguistic feature that allows the language
to perform more optimizations. Looking at the Chez manual, I see
this:
Optimize level 2 is safe, like optimize level 0, but allows the
compiler to assume that primitives like car always have their
original value. [...] If you wish to get a quick boost without
wrapping the program in a top-level let or module form and inserting
(import scheme) where appropriate, however, setting the optimization
level to 2 might still be useful.
Putting code in a MzScheme module serves a similar purpose, so it is
equivalent to your use of `opt=2'.
Yes. The program is question needs to work on Windows. When I started
writing it the only lexer generation tool that outputted C and I
trusted was flex. When I last tried I couldn't get flex to work
properly on Windows.
I know that DFAs are generally more efficient, but try building one by
hand!
> Before this thread is done, I will post links to my
> LexGen and ParseGen tools, which generate efficient
> lexers and parsers written in C, Java, or Scheme.
Thanks, I've searched the web and downloaded the code I'll give LexGen
a try.
> > * Other character sets like Unicode can cause big problems unless
> > handled correctly. For example, if you use a DFA to do the lexing
> > then don't make it use 16-bit Unicode. If it does then it's tables
> > will be much bigger. Instead you can normally do the lexing on byte
> > characters by doing some conversion. There are other possibilities.
>
> The DFA will run faster if implemented procedurally
> than if implemented as a table interpreter. When the
> DFA is implemented procedurally, using Unicode (whether
> 16 or 21 bits) will make no measurable difference.
Really?
As I see it there are two ways to begin a lexing routine, either
1) Take a character and do a table lookup. From that lookup either
execute code or lookup data
2) Go through a tree of conditional statements on a character.
As I see it #2 can only be faster if the table in #1 is very big, e.g.
in the case of 16bit or 21bit Unicode.
Is there something that makes this untrue?
> > Part of the reason people used to look at parser speed is because IO
> > used to be relatively faster than it is now. In recent times memory
> > and CPU speeds have gone up faster than IO latency.
>
> How do you reconcile that with Eli's claim that the
> read-char procedure, as implemented in MzScheme, is
> slow because of CPU branch instructions and the like?
I was talking about IO latency. The actual time hard-disks take to
move from one area on their surface to another is relatively very slow.
Once buffering is taken into account reading from any individual file
is very fast. So reading one long file can be very fast. Reading many
small files, which is common in program builds, is relatively slower.
Yes. If you go through a jump table, the jump will be
indirect. With most current microprocessors, indirect
jumps are slower than direct jumps. (And if you fetch
data from a table, you will need a second case dispatch
to interpret the data.)
The problem is essentially the same that compilers
face when compiling a case expression in Scheme or a
switch statement in C/C++/Java. If the number of cases
is very small, sequential search will probably be the
fastest way to implement the case expression. If the
number of cases is larger, a binary search will probably
be faster than a sequential search. If the number of
cases is very large, and dense within some range, then
a jump table will probably be faster than binary search.
The compiler probably knows more than you do about the
fastest way to implement a case expression on the target
architecture. That's why you should use case expressions,
and let the compiler figure out the fastest way. In all
probability, the fastest technique will vary from one
state of your DFA to another.
Regardless of the technique, whether your case expressions
dispatch on ASCII or Unicode characters makes hardly any
difference. You can see this by looking at the generated
code.
Will
> Using a module is just a linguistic feature that allows the language
> to perform more optimizations. Looking at the Chez manual, I see
> this:
>
> Optimize level 2 is safe, like optimize level 0, but allows the
> compiler to assume that primitives like car always have their
> original value. [...] If you wish to get a quick boost without
> wrapping the program in a top-level let or module form and inserting
> (import scheme) where appropriate, however, setting the optimization
> level to 2 might still be useful.
>
> Putting code in a MzScheme module serves a similar purpose, so it is
> equivalent to your use of `opt=2'.
Please Eli, wrapping things in a module does more than that. It allows
the compiler to perform interprocedural optimizations (like copy-
propagation, constant-folding, inlining, direct-call optimization,
specializing calling-conventions, redundant-checks elimination, and
much much more). I would not be surprized if Chez gained another order
of magnitude in speed by wrapping the read-char benchmark in a module
(or even a let) even at optimize-level 2. So, optimize-level 2 and
wrapping things in a module are definately NOT the same thing. Does
MzScheme have the equivalent of the ``--prim'' in mzc or optimize-level
2 in Chez? If there is one that delivers better performance without
*requiring* using the module system[*], well that would be useful.
Aziz,,,
[*] I don't have problem with module systems, it's just that I don't
like to have to remember the exact syntax of every implementation's
idea of what modules look like. The two implementations I use regularly
have the exact same idea of what modules are, so it's not hard for me to
context-switch between the two. Not to mention that I believe
compilers should work for the users and not impact their coding style
too much. Most Scheme users I know (your circle may be different) would
never touch a module system; they prefer code that's clean and simple
and left-flushed to the file. They do, however, add (optimize-level 2)
to the top of their files to get some 4x or 8x performance boost.
> I know that DFAs are generally more efficient, but try building one by
> hand!
You just keep where you are in the automata as part of the name of the
procedure representing that state. It's actually pretty
straight-forward. Here are the names of states I have in my parser:
(note: very simple integers only)
Semi ; for comments
Str ; for strings
Sym ; for symbols
SymB ; for |symbols|
Sign ; for +, -
Signed ; for a number after a sign.
Num ; for numbers with no sign
Dot ; for ... and the like
Unq ; for `, `@
Hash ; for things starting with #
Hash2 ; #2%prims
Hash3 ; #3%prims
HashBar ; #| commenets |#
HashBarBar ; #| comments | skipping over | bars |#
HashBang ; #!
HashBangE ; #!e
HashBandEO ; #!eo going for #!eof
HashS ; #\c
HashB ; #b binary numbers
HashBS ; #b+ and #b-
HashBSI ; #b10101011
HashX ; same for hexadecimal numbers
HashXS
HashXSI
Aziz,,,
> Eli Barzilay wrote:
>
> > Using a module is just a linguistic feature that allows the
> > language to perform more optimizations. Looking at the Chez
> > manual, I see this:
> > Optimize level 2 is safe, like optimize level 0, but allows the
> > compiler to assume that primitives like car always have their
> > original value. [...] If you wish to get a quick boost without
> > wrapping the program in a top-level let or module form and
> > inserting (import scheme) where appropriate, however, setting
> > the optimization level to 2 might still be useful.
> > Putting code in a MzScheme module serves a similar purpose, so it
> > is equivalent to your use of `opt=2'.
>
> Please Eli, wrapping things in a module does more than that. It
> allows the compiler to perform interprocedural optimizations (like
> copy- propagation, constant-folding, inlining, direct-call
> optimization, specializing calling-conventions, redundant-checks
> elimination, and much much more). I would not be surprized if Chez
> gained another order of magnitude in speed by wrapping the read-char
> benchmark in a module (or even a let) even at optimize-level 2.
I read the above paragraph as saying that opt=2 is not needed if
you're using a module. In fact, the paragraph on opt=2 begins with:
Another useful optimization level is 2, but its use is no longer
recommended. [...]
In any case, running the same code in Chez with opt=2 and with a
module (and with both) should be easy, so there's no point arguing on
what it does. Personally, I'm very impressed with what Chez does; if
I see the numbers of running the same code in a module in Chez, and
they indeed are an order of magnitude faster, I will still be
impressed.
Like I said, I never intended to get into a pissing contest. On the
other hand, you were the one who said:
| MzScheme is slow because it's an interpreter. Whether you want to
| call it a straight-forward interpreter or an optimizing interpreter,
| it's still an interpreter. [...]
> So, optimize-level 2 and wrapping things in a module are definately
> NOT the same thing.
If Chez + code-in-module is faster than opt=2, then I think that
modules should be used.
> Does MzScheme have the equivalent of the ``--prim'' in mzc or
> optimize-level 2 in Chez? If there is one that delivers better
> performance without *requiring* using the module system[*], well
> that would be useful.
Yes, there is a MzScheme equivalence for `--prim' -- you put the code
in a module. This gives you the equivalent power of `--prim', in a
much better way -- instead of a command-line argument that says "trust
me, I didn't play any funny games", you have a language feature that
makes it impossible to play such funny games. If you don't want to
put code in a module, then there is no equivalence for --prim, and
personally, I don't even want --prim since it always is a hack. I'd
just use CL instead of --prim, since they just forbid the kind of
changes that violate it.
> [*] I don't have problem with module systems, it's just that I don't
> like to have to remember the exact syntax of every implementation's
> idea of what modules look like. The two implementations I use
> regularly have the exact same idea of what modules are, so it's not
> hard for me to context-switch between the two. Not to mention that
> I believe compilers should work for the users and not impact their
> coding style too much. Most Scheme users I know (your circle may be
> different) would never touch a module system; they prefer code
> that's clean and simple and left-flushed to the file. They do,
> however, add (optimize-level 2) to the top of their files to get
> some 4x or 8x performance boost.
My circles are very different, yes. Indentation is just a cosmetic
issue (with a solution, if you really want one), but I don't see how
the compiler is working for me when I can use a module system to say
"trust me" in a formal way[*]. More than that -- I have many examples
of very useful code that would be completely impossible without a
module system, if all I had was that "clean and simple" toplevel.
[*] That is, this kind of "working for me" is something I don't want
it to do. It's the same kind of working-for-me as other flags like
"assume that x is an integer", or "assume that all arithmetics is
fixnum".
Well you can do that and it will work fine for a deterministic finite
automata, but what do you do when the finite automata becomes
non-deterministic?
For example imagine the lexer has read:
456
And is at the 6. The next character read is ".", at this stage the
number may be an integer or floating point, or symbol. After that if
another number is read then we have a floating point number, if a
delimiter is read then its a whole number, if it's a character that is
part of scheme number syntax is read then it's still a number of some
kind. If its a character that is not part of the syntax then it's a
symbol!
The simplest way to deal with these situations is to use an NFA, make
an assumption about what is being read, then if it's false fail to some
starts state and re-read. It isn't the fastest way to do it of-course.
For example given an DFA the state the lexer goes into after reading
the dot would be best described as:
"Main part of number has finished, dot read, might be number or symbol
(or error)"
Not the best name for a function or variable, and not easy to shorten
into anything meaningful. This is one reason why DFAs are generally
constructed by programs.
Sure. It depends on the amount of data to look through. If using an
indirect jump costs 10 cycles, then if you can fit a set of directly
branching tests into that time then they will be faster. For say 255
possibilities I suppose its possible that a direct branching test may
be as fast as an indirect one, particularly if it filters off the most
common possiblities first.
> The problem is essentially the same that compilers
> face when compiling a case expression in Scheme or a
> switch statement in C/C++/Java. If the number of cases
> is very small, sequential search will probably be the
> fastest way to implement the case expression. If the
> number of cases is larger, a binary search will probably
> be faster than a sequential search. If the number of
> cases is very large, and dense within some range, then
> a jump table will probably be faster than binary search.
>
> The compiler probably knows more than you do about the
> fastest way to implement a case expression on the target
> architecture. That's why you should use case expressions,
> and let the compiler figure out the fastest way.
Maybe. I think it depends on the compiler. I remember reading a
discussion on the GCC mailing list where someone said that no-one uses
case statements with more than 400 cases, and that the compiler could
more or less assume that any code that did would be broken and not
attempt to generate good code for it.
> In all
> probability, the fastest technique will vary from one
> state of your DFA to another.
Yep.
> Regardless of the technique, whether your case expressions
> dispatch on ASCII or Unicode characters makes hardly any
> difference. You can see this by looking at the generated
> code.
But if you use ASCII then there are only 255 possibilites, so its easy
to use an indirect branch table on the basis of a byte. If it can be
done faster with a linear or binary search then that's good.
If 16-bit unicode is used though then a jump table would be huge.
Indirect branching wouldn't be that practical, so some combination of
binary and linear searching would be needed. I can't see that this
would be fast, certainly not as fast as a table or a small number of
direct branches.
You could of-course forget about some large set of Unicode characters
and more or less treat it as ASCII.
I think you may be forgetting that the number of possible
characters on which you are dispatching is not what
matters. What matters is the number of case clauses,
and the number of subranges that select those clauses.
In a lexical analyzer for Scheme, for example, the most complex
state has about 15 possible actions. The number of subranges
is completely independent of the size of the character set,
so the size of the character set has zero influence on the
cost of the dispatch.
> Maybe. I think it depends on the compiler. I remember reading a
> discussion on the GCC mailing list where someone said that no-one uses
> case statements with more than 400 cases, and that the compiler could
> more or less assume that any code that did would be broken and not
> attempt to generate good code for it.
That sounds like something you would hear on a gcc
mailing list. (In fairness, the readers of that list would
have just as good a laugh at some of the things said
on this list.)
> If 16-bit unicode is used though then a jump table would be huge.
> Indirect branching wouldn't be that practical, so some combination of
> binary and linear searching would be needed. I can't see that this
> would be fast, certainly not as fast as a table or a small number of
> direct branches.
Perhaps an example would help. Consider the most complex
state of the lexical analyzer DFA for Scheme:
(define (state0 c)
(case c
((#\;) (consumechar) (state29 (scanchar)))
((#\+) (consumechar) (state28 (scanchar)))
((#\-) (consumechar) (state28 (scanchar)))
((#\.) (consumechar) (state16 (scanchar)))
((#\!) (consumechar) (state14 (scanchar)))
((#\$) (consumechar) (state14 (scanchar)))
((#\%) (consumechar) (state14 (scanchar)))
((#\&) (consumechar) (state14 (scanchar)))
((#\*) (consumechar) (state14 (scanchar)))
((#\/) (consumechar) (state14 (scanchar)))
((#\:) (consumechar) (state14 (scanchar)))
((#\<) (consumechar) (state14 (scanchar)))
((#\=) (consumechar) (state14 (scanchar)))
((#\>) (consumechar) (state14 (scanchar)))
((#\?) (consumechar) (state14 (scanchar)))
((#\^) (consumechar) (state14 (scanchar)))
((#\_) (consumechar) (state14 (scanchar)))
((#\~) (consumechar) (state14 (scanchar)))
((#\#) (consumechar) (state13 (scanchar)))
((#\") (consumechar) (state2 (scanchar)))
((#\() (consumechar) (accept 'lparen))
((#\)) (consumechar) (accept 'rparen))
((#\') (consumechar) (accept 'quote))
((#\`) (consumechar) (accept 'backquote))
((#\,) (consumechar) (state1 (scanchar)))
((#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)
(begin (consumechar) (state27 (scanchar))))
((#\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m
#\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z
#\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M
#\N #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z)
(begin (consumechar) (state14 (scanchar))))
((#\space #\newline)
(begin (consumechar) (state30 (scanchar))))
(else
(scannererror))))
After going through the first two passes of Twobit, this
turns into:
(begin
(set! state0
(lambda (.c|1)
(let* ((.temp|2|5 .c|1)
(.n|144
(if (.char? .temp|2|5)
(let ((.n|145 (.char->integer:chr .temp|2|5)))
(if (.<:fix:fix .n|145 10)
0
(if (.<:fix:fix .n|145 128)
(.vector-ref:trusted
'#(28
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
28
5
20
19
6
7
8
23
21
22
9
2
25
3
4
10
26
26
26
26
26
26
26
26
26
26
11
1
12
13
14
15
0
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
0
0
0
16
17
24
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
0
0
0
18
0)
(.-:idx:idx .n|145 10))
0)))
0)))
(if (.<:fix:fix .n|144 14)
(if (.<:fix:fix .n|144 7)
(if (.<:fix:fix .n|144 3)
(if (.<:fix:fix .n|144 1)
(scannererror)
(if (.<:fix:fix .n|144 2)
(begin (consumechar) (state29 (scanchar)))
(begin (consumechar) (state28
(scanchar)))))
(if (.<:fix:fix .n|144 5)
(if (.<:fix:fix .n|144 4)
(begin (consumechar) (state28 (scanchar)))
(begin (consumechar) (state16 (scanchar))))
(if (.<:fix:fix .n|144 6)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14
(scanchar))))))
(if (.<:fix:fix .n|144 10)
(if (.<:fix:fix .n|144 8)
(begin (consumechar) (state14 (scanchar)))
(if (.<:fix:fix .n|144 9)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14
(scanchar)))))
(if (.<:fix:fix .n|144 12)
(if (.<:fix:fix .n|144 11)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14 (scanchar))))
(if (.<:fix:fix .n|144 13)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14
(scanchar)))))))
(if (.<:fix:fix .n|144 21)
(if (.<:fix:fix .n|144 17)
(if (.<:fix:fix .n|144 15)
(begin (consumechar) (state14 (scanchar)))
(if (.<:fix:fix .n|144 16)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14
(scanchar)))))
(if (.<:fix:fix .n|144 19)
(if (.<:fix:fix .n|144 18)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state14 (scanchar))))
(if (.<:fix:fix .n|144 20)
(begin (consumechar) (state13 (scanchar)))
(begin (consumechar) (state2
(scanchar))))))
(if (.<:fix:fix .n|144 25)
(if (.<:fix:fix .n|144 23)
(if (.<:fix:fix .n|144 22)
(begin (consumechar) (accept 'lparen))
(begin (consumechar) (accept 'rparen)))
(if (.<:fix:fix .n|144 24)
(begin (consumechar) (accept 'quote))
(begin (consumechar) (accept 'backquote))))
(if (.<:fix:fix .n|144 27)
(if (.<:fix:fix .n|144 26)
(begin (consumechar) (state1 (scanchar)))
(begin (consumechar) (state27 (scanchar))))
(if (.<:fix:fix .n|144 28)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state30
(scanchar)))))))))))
'state0)
This may look awful, but it runs pretty quick. There
are at most 7 fixnum comparisons in the dispatch.
You may think that's because there are 2^7 = 128
ASCII characters, but it's just a coincidence. The
dispatch code shown above works fine for 21-bit
Unicode.
The compiler that generated the code above is
smart enough to recognize dense subranges and
to treat them differently from sparse ranges---all
within the same case expression. That's why this
approach will continue to work well if non-ASCII
characters become legal, and are categorized
according to their Unicode properties.
Will
And that's because there is a performance bug in my
compiler that I hadn't recognized until now. Had the
compiler merged case clauses whose expressions
are the same, the generated code would have had
at most 6 fixnum comparisons in the dispatch. See
below.
Will
--------
(begin
(set! state0
(lambda (.c|1)
(let* ((.temp|2|5 .c|1)
(.n|69 (if (.char? .temp|2|5)
(let ((.n|70 (.char->integer:chr .temp|2|5)))
(if (.<:fix:fix .n|70 10)
0
(if (.<:fix:fix .n|70 128)
(.vector-ref:trusted
'#(13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
13
12
5
4
12
12
12
8
6
7
12
2
10
2
3
12
11
11
11
11
11
11
11
11
11
11
12
1
12
12
12
12
0
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
0
0
0
12
12
9
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
0
0
0
12
0)
(.-:idx:idx .n|70 10))
0)))
0)))
(if (.<:fix:fix .n|69 7)
(if (.<:fix:fix .n|69 3)
(if (.<:fix:fix .n|69 1)
(scannererror)
(if (.<:fix:fix .n|69 2)
(begin (consumechar) (state29 (scanchar)))
(begin (consumechar) (state28 (scanchar)))))
(if (.<:fix:fix .n|69 5)
(if (.<:fix:fix .n|69 4)
(begin (consumechar) (state16 (scanchar)))
(begin (consumechar) (state13 (scanchar))))
(if (.<:fix:fix .n|69 6)
(begin (consumechar) (state2 (scanchar)))
(begin (consumechar) (accept 'lparen)))))
(if (.<:fix:fix .n|69 10)
(if (.<:fix:fix .n|69 8)
(begin (consumechar) (accept 'rparen))
(if (.<:fix:fix .n|69 9)
(begin (consumechar) (accept 'quote))
(begin (consumechar) (accept 'backquote))))
(if (.<:fix:fix .n|69 12)
(if (.<:fix:fix .n|69 11)
(begin (consumechar) (state1 (scanchar)))
(begin (consumechar) (state27 (scanchar))))
(if (.<:fix:fix .n|69 13)
(begin (consumechar) (state14 (scanchar)))
(begin (consumechar) (state30
(scanchar))))))))))
'state0)
>>Perhaps the other participants in the thread have hard numbers
>>for their compilers?
>
> We can use load to approximate the compile time, because
> "nboyer.sch" performs virtually no computation as it is
> loaded. Average of 3 runs on a slower Solaris machine
> than the one I used last time:
>
> Chez Larceny MzScheme
> Scheme fast-safe (default)
> opt=2
>
> load 4.54 89.12 11.42
> read .68 3.35 3.67
> read-char (file) .37 .99 4.56
> read-char (string) .22 1.08 4.68
> string-ref .78 .12 4.73
>
> read/load 15 % 4 % 32 %
>
>
> Given Larceny's very slow compiler, its slow parser is fast
> enough. Larceny's parser would not be fast enough for Chez
> Scheme, however, because Chez Scheme's compiler is extremely
> fast.
>
> In the opinion of those who implemented MzScheme, a parser
> written in interpreted MzScheme would not be fast enough.
> Even though MzScheme's parser is written in C/C++, it still
> accounts for a third of the compile time.
Thanks for providing numbers. I must admit I am surprised
to hear the number 32%. How to bring this down? Hmm. I think
I need to write some extra passes to bring the "load" time
up :-)
Wouldn't it be marginally better to measure the parsing time
by adding the time for reading and the time for expansion than
using the load time?
--
Jens Axel Søgaard
You caught me cooking the books. I am so ashamed.
If you turn off almost all compiler optimizations,
so that Larceny's compiler more closely approximates
that of MzScheme, then Larceny's load is still four or
five times as slow as MzScheme's. If you turn the
compiler off altogether and use Larceny's interpreter,
making the comparison even more apt, then Larceny's
load time will be about twice as fast as MzScheme's.
Since Larceny's interpreter is written in Scheme,
and MzScheme's is written in C/C++, we see once
again that Scheme runs faster than C/C++. (In
the interest of science, I should point out that
neither interpreter is a pure interpreter. A pure
interpreter would load even faster.)
Before anyone accuses me of selective leaking,
let me say that I just want the American people
to hear the truth. What's more, if anyone in the
Larceny administration is found to have leaked
classified intelligence to the running-dog so-called
media, I will decline comment until all legal appeals
have been exhausted and I have pardoned my good
friend and loyal American, uh...
> Wouldn't it be marginally better to measure the parsing time
> by adding the time for reading and the time for expansion than
> using the load time?
I don't think I understand the question, probably
because I'm from Texas. If you're suggesting I
measure either the macro-expansion time or compile
time separately from the load time, and then add
one of those times to the read time, then you're
suggesting something that can't be done portably.
Adding the read time to the compile time wouldn't
be measurably different from the load time anyway.
In Larceny, adding the read time to the macro
expansion time would yield a number only slightly
less than the load time using the interpreter,
which uses exactly the same macro expander as the
compiler, and performs only one very quick pass
over the macro-expanded code.
Will
> Wouldn't it be marginally better to measure the parsing time
> by adding the time for reading and the time for expansion than
> using the load time?
We can benchmark the expansion time by itself using MzScheme and
Petite-Chez by putting the following code in a file "time-expand.ss":
===CUT
(define nboyer-expr
(let ([p (open-input-file "nboyer.sch")])
(cons 'begin
(let f ()
(let ([x (read p)])
(cond
[(eof-object? x) (close-input-port p) '()]
[else (cons x (f))]))))))
(define (many f n)
(unless (zero? n)
(f)
(many f (- n 1))))
(define (test-expand)
(expand nboyer-expr))
(time (many test-expand 1000))
===CUT
On my laptop (iBook, 800MHz) I get:
Welcome to MzScheme version 301, Copyright (c) 2004-2005 PLT Scheme Inc.
> (load "time-expand.ss")
cpu time: 53240 real time: 63707 gc time: 12340
Petite Chez Scheme Version 6.9c
Copyright (c) 1985-2003 Cadence Research Systems
> (load "time-expand.ss")
(time (many test-expand ...))
133 collections
3580 ms elapsed cpu time, including 120 ms collecting
4361 ms elapsed real time, including 116 ms collecting
143830656 bytes allocated, including 143738496 bytes reclaimed
To plagiarize Will Clinger:
Since Petite's expander is written in Scheme,
and MzScheme's is written in C/C++, we see once
again that Scheme runs faster than C/C++.
Aziz,,,
[Sorry Eli for more pissing-contest related posts.]
Well, I asked for the numbers and you gave exactly
the numbers I asked for. But when I tried to make sense
of them I realized the percentage numbers didn't say
that much after all.
> If you turn off almost all compiler optimizations,
> so that Larceny's compiler more closely approximates
> that of MzScheme, then Larceny's load is still four or
> five times as slow as MzScheme's. If you turn the
> compiler off altogether and use Larceny's interpreter,
> making the comparison even more apt, then Larceny's
> load time will be about twice as fast as MzScheme's.
That makes the numbers more comparable indeed.
> Since Larceny's interpreter is written in Scheme,
> and MzScheme's is written in C/C++, we see once
> again that Scheme runs faster than C/C++. (In
> the interest of science, I should point out that
> neither interpreter is a pure interpreter. A pure
> interpreter would load even faster.)
>
> Before anyone accuses me of selective leaking,
> let me say that I just want the American people
> to hear the truth. What's more, if anyone in the
> Larceny administration is found to have leaked
> classified intelligence to the running-dog so-called
> media, I will decline comment until all legal appeals
> have been exhausted and I have pardoned my good
> friend and loyal American, uh...
:-)
>>Wouldn't it be marginally better to measure the parsing time
>>by adding the time for reading and the time for expansion than
>>using the load time?
>
> I don't think I understand the question, probably
> because I'm from Texas. If you're suggesting I
> measure either the macro-expansion time or compile
> time separately from the load time, and then add
> one of those times to the read time, then you're
> suggesting something that can't be done portably.
That's what I had in mind. And even though it isn't portable,
it is not uncommon to see expand and friends:
<http://docs.plt-scheme.org/mzscheme/mzscheme-Z-H-12.html#node_sec_12.6.1>
I wrote "marginally better" because some expands might do more work
than strictly necessary to help the later passes.
--
Jens Axel Søgaard
Well, I asked for the numbers and you gave exactly
the numbers I asked for. But when I tried to make sense
of them I realized the percentage numbers didn't say
that much after all.
> If you turn off almost all compiler optimizations,
> so that Larceny's compiler more closely approximates
> that of MzScheme, then Larceny's load is still four or
> five times as slow as MzScheme's. If you turn the
> compiler off altogether and use Larceny's interpreter,
> making the comparison even more apt, then Larceny's
> load time will be about twice as fast as MzScheme's.
That makes the numbers more comparable indeed.
> Since Larceny's interpreter is written in Scheme,
> and MzScheme's is written in C/C++, we see once
> again that Scheme runs faster than C/C++. (In
> the interest of science, I should point out that
> neither interpreter is a pure interpreter. A pure
> interpreter would load even faster.)
>
> Before anyone accuses me of selective leaking,
> let me say that I just want the American people
> to hear the truth. What's more, if anyone in the
> Larceny administration is found to have leaked
> classified intelligence to the running-dog so-called
> media, I will decline comment until all legal appeals
> have been exhausted and I have pardoned my good
> friend and loyal American, uh...
:-)
>>Wouldn't it be marginally better to measure the parsing time
>>by adding the time for reading and the time for expansion than
>>using the load time?
>
> I don't think I understand the question, probably
> because I'm from Texas. If you're suggesting I
> measure either the macro-expansion time or compile
> time separately from the load time, and then add
> one of those times to the read time, then you're
> suggesting something that can't be done portably.
That's what I had in mind. And even though it isn't portable,
On my system the pair read-syntax and expand-syntax are
10% faster.
(define nboyer-expr
(let ([p (open-input-file "nboyer.sch")])
(cons 'begin
(let f ()
(let ([x (read-syntax "nboyer.sch" p)])
(cond
[(eof-object? x) (close-input-port p) '()]
[else (cons x (f))]))))))
(define (many f n)
(unless (zero? n)
(f)
(many f (- n 1))))
(define (test-expand-syntax)
(expand nboyer-expr))
(collect-garbage)
(time (many test-expand-syntax 1000))
Still does these numbers make more or less sense than
the read/load ratio?
--
Jens Axel Søgaard
> Eli Barzilay wrote:
>
> > Using a module is just a linguistic feature that allows the language
> > to perform more optimizations. Looking at the Chez manual, I see
> > this:
> > Optimize level 2 is safe, like optimize level 0, but allows the
> > compiler to assume that primitives like car always have their
> > original value. [...] If you wish to get a quick boost without
> > wrapping the program in a top-level let or module form and inserting
> > (import scheme) where appropriate, however, setting the optimization
> > level to 2 might still be useful.
> > Putting code in a MzScheme module serves a similar purpose, so it is
> > equivalent to your use of `opt=2'.
>
> Please Eli, wrapping things in a module does more than that. It
> allows the compiler to perform interprocedural optimizations [...]
> Does MzScheme have the equivalent of the ``--prim'' in mzc or
> optimize-level 2 in Chez? If there is one that delivers better
> performance without *requiring* using the module system[*], well
> that would be useful.
Two more notes that I forgot when I posted yesterday, that are
relevant for people who suffer from modulophobia[*]:
1. At the MzScheme toplevel, (require mzscheme) changes things so that
the mzscheme bindings are read-only, so it buys some additional
performace even without putting the code in a module. (Matthew
clarified that mzc's --prim does just that.)
2. The code that Will generated is all in a single function definition
anyway, which also allows optimizations that a module wrapper does.
(I forgot that the code itself was not posted.)
Running the benchmark again shows that prefixing the code with a
(require mzscheme) results in exactly the same runtimes.
[*] While this post shows that you get the same performance in
MzScheme without putting your code in a module, it should not be taken
as medical advise. Please consult your doctor if you suffer from
modulophobia, or if you observe related symptoms. (Symptoms include
allergic reaction to some forms of indentation, swelling of the code,
premature optimization, and code-rot in some extreme cases.)
> 1. At the MzScheme toplevel, (require mzscheme) changes things so that
> the mzscheme bindings are read-only, so it buys some additional
> performace even without putting the code in a module. (Matthew
> clarified that mzc's --prim does just that.)
Thanks to Matthew for showing how to get the effect of (import scheme)
or (optimize-level 2) in MzScheme without using modules. Also thanks to
Matthew for clarifying what --prim does (it makes sense now). It seems
like Matthew's ideas of what should or should not be in the language
make far more sense than your ideals from yesterday's post:
Yes, there is a MzScheme equivalence for `--prim' -- you put the code
in a module. This gives you the equivalent power of `--prim', in a
much better way -- instead of a command-line argument that says "trust
me, I didn't play any funny games", you have a language feature that
makes it impossible to play such funny games. If you don't want to
put code in a module, then there is no equivalence for --prim, and
personally, I don't even want --prim since it always is a hack. I'd
just use CL instead of --prim, since they just forbid the kind of
changes that violate it.
Aziz,,,
> Thanks to Matthew for showing how to get the effect of (import scheme)
> or (optimize-level 2) in MzScheme without using modules. Also thanks to
> Matthew for clarifying what --prim does (it makes sense now). It seems
> like Matthew's ideas of what should or should not be in the language
> make far more sense than your ideals from yesterday's post:
>
> Yes, there is a MzScheme equivalence for `--prim' -- you put the code
> in a module. This gives you the equivalent power of `--prim',
This equivalence is still true.
> in a much better way -- instead of a command-line argument that
> says "trust me, I didn't play any funny games", you have a
> language feature that makes it impossible to play such funny
> games. If you don't want to put code in a module, then there is
> no equivalence for --prim, and personally, I don't even want
> --prim since it always is a hack. I'd just use CL instead of
> --prim, since they just forbid the kind of changes that violate
> it.
And here I don't talk about optimization at all, but about the
advantages of using a module system, and that still holds. (I can't
speak for Matthew and say what he thinks about a module system, but
he's the one who developed it (the composable & compilable thing), so
it's safe to assume that he agrees with these advantages and prefers
moduled code too.)
> [...]
>
> Welcome to MzScheme version 301, Copyright (c) 2004-2005 PLT Scheme Inc.
> > (load "time-expand.ss")
> cpu time: 53240 real time: 63707 gc time: 12340
>
> Petite Chez Scheme Version 6.9c
> Copyright (c) 1985-2003 Cadence Research Systems
>
> > (load "time-expand.ss")
> (time (many test-expand ...))
> 133 collections
> 3580 ms elapsed cpu time, including 120 ms collecting
> 4361 ms elapsed real time, including 116 ms collecting
> 143830656 bytes allocated, including 143738496 bytes reclaimed
>
> To plagiarize Will Clinger:
>
> Since Petite's expander is written in Scheme,
> and MzScheme's is written in C/C++, we see once
> again that Scheme runs faster than C/C++.
The syntax system differences are an order or magnitude bigger than
the I/O system differences.
> [Sorry Eli for more pissing-contest related posts.]
That's OK. And something to post more about -- exapnding code in
MzScheme is slower than compiling it. (Yes, expansion is part of
compilation, but during compilation some stuff that `expand' does is
not needed.)
> The syntax system differences are an order or magnitude bigger than
> the I/O system differences.
Probably you need to consult with the *experts* on the effects these
differences have on performance.
Aziz,,,
That's interesting. The C/C++ code for MzScheme's syntax
system is based on Kent Dybvig's portable implementation
of syntax-case in Scheme, which is derived from the code
used in Chez Scheme.
> That's OK. And something to post more about -- exapnding code in
> MzScheme is slower than compiling it.
But not by a lot.
> (Yes, expansion is part of
> compilation, but during compilation some stuff that `expand' does is
> not needed.)
That's true in Larceny, too. Again, the difference is small.
Will
> Eli Barzilay wrote:
> > > Since Petite's expander is written in Scheme,
> > > and MzScheme's is written in C/C++, we see once
> > > again that Scheme runs faster than C/C++.
> >
> > The syntax system differences are an order or magnitude bigger
> > than the I/O system differences.
>
> That's interesting. The C/C++ code for MzScheme's syntax
> system is based on Kent Dybvig's portable implementation
> of syntax-case in Scheme, which is derived from the code
> used in Chez Scheme.
It keeps information like source location (pos, line, col, span) which
take much more space and more time to manage, and it checks a code
inspector and code certificates that are used to forbid disassembling
a macro result and using it in a different context.
Now that you've explained it to me I understand the problem much
better.
In the case of Scheme you're certainly right, few cases are needed, it
seems only a few more than are needed to implement an NFA with
conditional statements. This makes a DFA much more attractive.
Looking at the source code for MzScheme (schpriv.h),
I see that this information might double the space used
to represent a syntax object (Scheme_Stx).
If a doubling of the space required to represent syntax
objects truly accounts for a significant portion of the
10:1 ratio in macro expansion time between MzScheme
and Chez Scheme that Aziz observed on the PowerPC and
I verified on the Sparc, then it highlights one of the
performance advantages of writing your parser, macro
expander, and compiler in Scheme:
Allocation, initialization, and deallocation of heap
storage is usually faster in Scheme than in C, C++,
or Java.
Will
> Allocation, initialization, and deallocation of heap
> storage is usually faster in Scheme than in C, C++,
> or Java.
Really? We should be careful about generalizations:
>
> Will
>
Welcome to MzScheme version 301, Copyright (c) 2004-2005 PLT Scheme Inc.
> (load "~/prelude.scm")
> (time (do-n 10000000 (cons 1 2)))
cpu time: 3619 real time: 3629 gc time: 1096
(1 . 2)
----
public class Cons {
static class Pair {
Object car, cdr;
public Pair(Object car, Object cdr) {
this.car=car;
this.cdr=cdr;
}
}
public static void main(String[] args) {
long start=System.currentTimeMillis();
for (int i=0; i<10000000; i++) {
Pair p=new Pair(new Integer(1), new Integer(2));
}
System.err.println(System.currentTimeMillis()-start);
}
}
----
scgmille@ferrite:/tmp$ java Cons
521
Scott G. Miller wrote:
> > Allocation, initialization, and deallocation of heap
> > storage is usually faster in Scheme than in C, C++,
> > or Java.
>
> Really? We should be careful about generalizations:
>
> Welcome to MzScheme version 301, Copyright (c) 2004-2005 PLT Scheme Inc.
> > (load "~/prelude.scm")
> > (time (do-n 10000000 (cons 1 2)))
> cpu time: 3619 real time: 3629 gc time: 1096
> (1 . 2)
>
> scgmille@ferrite:/tmp$ java Cons
> 521
I suspect you were comparing interpreted Scheme with
(JIT-) compiled Java, a practice I decried earlier in this
thread. Comparing apples to apples, and oranges to
oranges, we get the following timings (in seconds) as
the average of three runs on a Solaris machine:
Compiled:
Chez Scheme v6.1 (opt=2) .6
Larceny v0.91a (fast-safe) .6
java Cons (Hotspot) 1.5
Interpreted systems:
Larceny v0.91a interpreter 2.2
MzScheme v301 (w/module) 11.9
java -Xint Cons (Hotspot) 17.5
Since Larceny's interpreter is written in Scheme, while
the MzScheme and Hotspot interpreters are written in
C/C++, these numbers suggest yet another interesting
hypothesis: the fastest way to interpret Scheme is to
write your interpreter in Scheme.
Before we conclude that, however, I'd like to point out
that Scott's benchmark is not very realistic. It would be
slightly better to compare the performance of these systems
on the GCBench synthetic benchmark, which was originally
written in Java and is perhaps the most widely used storage
management benchmark within the Java community. It has
been translated into C, C++, and Scheme, and the code for
these should not be hard to find. Enjoy.
Will
Into this:
new Pair(Integer.valueOf(1), Integer.valueOf(2));
I get a 60% boost, just from this. And using "static final" constants
instead, gives another 15% boost over it. Using plain ints... well that
wouldn't be anything near fair to Scheme.
Why be fair to Scheme?
Compiled:
java ConsInt (HotSpot) .5
Chez Scheme v6.1 (opt=2) .6
Larceny v0.91a (fast-safe) .6
java Cons (HotSpot) 1.5
Interpreted:
Larceny v0.91a interpreter 2.2
java -Xint ConsInt (HotSpot) 6.2
MzScheme v301 (w/module) 11.9
java -Xint Cons (Hotspot) 17.5
Will
--------
public class ConsInt {
static class Pair {
int car, cdr;
public Pair(int car, int cdr) {
this.car=car;
this.cdr=cdr;
}
}
public static void main(String[] args) {
long start=System.currentTimeMillis();
for (int i=0; i<10000000; i++) {
Pair p=new Pair(1, 2);
}
System.err.println(System.currentTimeMillis()-start);
}
}
--------
(define (cons-benchmark . rest)
(let ((n (if (null? rest) 10000000 (car rest))))
(run-benchmark (string-append "cons:" (number->string n))
n
(lambda () (cons 1 2))
(lambda (result) (equal? result '(1 . 2))))))
But why would that be? In the case of C++, I can think of a few
possibilities:
- Frequent, small allocation and deallocation requests to the
underlying OS heap API
- Calls to object destructors for each small deallocation
These issues can be eliminated by the use of custom memory allocators
or pools. For example, you could construct a pool that preallocated
larger chunks of memory and that only allocates objects of a specific
size (that of the syntax object for example). The pool could also
designed such that the memory chunks are only deallocated as a whole at
the end of parsing without any call to each object's destructor,
resulting in a couple of deallocations rather than tens of thousands.
This is just speculation of course since I've learned that what you
think is intuitively true about optimization and performance often
turns out to be quite different when you actually go and measure
performance.
Any other ideas why Scheme would be faster than C++ and Java for heap
allocation? I thought Java would at least be somewhat equivalent to
Scheme given that they both use garbage collection.
The cost of allocation in Scheme[*] is the cost of incrementing a
pointer plus the cost of overflow checks. Overflow checks are not
necessary at every allocation since any series of conses, or closures,
or any objects of known size require only one check. The overflow
check can almost be aliminated altogether if the operating system
provides means of preserving the state of the system on segfaults.
A protected page is placed on top of the heap and if any allocation is
prefixed with a memory-set to the highest word in the object, then an
overflow can trigger the GC via a signal handler.
For example, in my compiler, the procedure (lambda (x y) (cons x y))
is compiled to (omitting error-checks and irrelevant code):
3: (ppc:addi acr ap 8) ; increment ap by 8, store result in acr
4: (ppc:cmpw 0 acr eap) ; compare acr to eap register
5: (ppc:ble 15) ; if enough room, goto 15
6: instructions for saving state and invoking
14: the garbage collector then falls through
15: (ppc:addi p0 ap 3) ; p0 is the tagged cons
16: (ppc:addi ap ap 8) ; ap is moved up
17: (ppc:stw ac p0 -3) ; store the car
18: (ppc:stw p1 p0 1) ; store the cdr
19: (ppc:blr) ; return
Now if I were really smart, I could've shaved another instruction or
two from this code; but I'm not, so it stays.
So, ignoring the return, we have 7 instructions: 3 for the overflow
check and 4 for the cons. The code has no memory references, only
two memory sets, which is the bare minimum since you have to store
the car and the cdr.
Now I have written a decent amount of high-performance C code for my
garbage collector, and from my experience, C is not easy to optimize.
You just cannot convince a C compiler to place values (like ap) in
a registers across calls that modify it and return the updated ap in
a register. A C compiler keeps on saving it on the stack and passing
a pointer to it, then loading it back from the stack. If you place it
in a global variable, then it loses all chances of being in a register.
So, everything you do allocation-wise will involve many memory saves and
loads that are not needed in the equivalent compiled Scheme code.
I don't know what a Java compiler would do. My impression is that you
cannot allocate a 2-word cons cell in Java. A pair has to be in a class
and must point to its class descriptor so it has to be 3 to 4 words in
size at the minimum. Doubling the memory footprint for something so
commonly used in Scheme will degrade performance (I cannot guess by how
much though).
Aziz,,,
[*] I'm talking about good native-code scheme compilers; not scheme
interpreters, compilers for byte-code interpreters, or JIT compilers
for byte-code objects.
I think that the advantage that you're seeing here is not an advantage
of the code Scheme implementations create, but of the higher level of
the language.
When I wrote my reader and printer in C there were several places where
I could improve the algorithm. I could either improve the algorithm at
the cost of more testing, at the cost of more code or of less clear
code. None of these things seemed worth it, because that part of the
program was difficult enough to understand anyway.
I think this is the real benefit. The code, if well written, begins by
being simpler so more optimization of it can be done without
compromising understandability. Someone with sufficient determination
could make something better performing in C but it would take longer
and be harder to maintain.
I agree. The remark you quoted was of the form "If
<unlikely-thing-I-didn't-want-to-argue-about>, then
<relatively-unknown-fact-that's-true-anyway>."
MzScheme's security certificates are a much more
likely time sink than its source code information.
> When I wrote my reader and printer in C there were several places where
> I could improve the algorithm. I could either improve the algorithm at
> the cost of more testing, at the cost of more code or of less clear
> code. None of these things seemed worth it, because that part of the
> program was difficult enough to understand anyway.
>
> I think this is the real benefit. The code, if well written, begins by
> being simpler so more optimization of it can be done without
> compromising understandability.
Exactly right!
The macro expansion benchmark doesn't involve any
syntax-case or other low-level macros, so the security
problem that motivates MzScheme's security certificates
cannot possibly arise. Once again, users are being made
to pay for a feature (low-level macros) even when they
don't use it.
You would think it should be easy to modify the macro
expander so it omits those security checks for programs
that don't use low-level macros. It should be, and it
probably would be if MzScheme's macro expander were
written in Scheme, but it isn't.
> Someone with sufficient determination
> could make something better performing in C but it would take longer
> and be harder to maintain.
Maybe, but I doubt it. C and C++ really do have some
intrinsic performance disadvantages for the applications
we're talking about in this thread, so I doubt whether
they can truly match Scheme's performance. They could
do a lot better than we're accustomed to seeing, though.
Adrian's post is related to this. I'll reply to his post
when I have a little time.
Will
I think its fairly unlikely that if a large quantity of time were
available to write the reader that C could be beaten. The C programmer
has the possibility of doing much low level optimization, some specific
Scheme implementations like Stalin allow things like this, but don't
make them possible quite to the extent C does. Also, assuming we're
talking about the insides of a Scheme implementation use of C allows
the programmer to allocate and deallocate memory manually, or to use
the GC. A Scheme implementation could allow the same though, as Stalin
does.
I think the best indicator that Scheme could be beaten is that most
Scheme compilers actually output C. Meaning there is at least one C
implementation the same speed as the Scheme one, the one the compiler
used.
None of this means that much in practice about readers though, because
nobodys going to write a special performance reader in C.
It does mean something about Scheme implementations though: If
implementation writers want underlying algorithms written in Scheme
then they should provide functions to do the quite low-level things C
allows.
It comes down to whether you get more performance by
bumming the low-level details that C lets you bum (while
being unable to do anything about the low-level details C
won't let you bum), or by bumming the high-level details
that you could be bumming instead, while enjoying the
efficiency with which the low-level details are handled by
a performant Scheme system (which cannot be matched by C).
> Also, assuming we're
> talking about the insides of a Scheme implementation use of C allows
> the programmer to allocate and deallocate memory manually, or to use
> the GC.
Don't forget that a C program cannot use garbage-collected
memory without registering its pointers with the collector,
or by relying on a conservative collector. Both of those
approaches have costs that performant Scheme systems need
not incur.
> I think the best indicator that Scheme could be beaten is that most
> Scheme compilers actually output C. Meaning there is at least one C
> implementation the same speed as the Scheme one, the one the compiler
> used.
I agree that an implementation of Scheme that compiles
to C cannot outperform the C code it generates. For the
applications we've been discussing in this thread, the
fastest implementations of Scheme do not compile to C.
On this class of applications, compiling to C instead of
to native code generally seems to cost about a factor
of two in performance, which is large enough to swamp
the performance advantages of Scheme. You will see
this in the benchmark results I will cite when replying
to Adrian's latest post.
> It does mean something about Scheme implementations though: If
> implementation writers want underlying algorithms written in Scheme
> then they should provide functions to do the quite low-level things C
> allows.
I disagree. Exposing low-level details, and making them
programmable, tends to interfere with the high-level
optimizations that are more important for the kind of
applications we've been talking about in this thread.
C's hostility to garbage collection is just one example
of that.
Will
I can't think of any high-level optimization that you can do with
Scheme that can't be done with C. There are some low-level ones, but
not many.
> while enjoying the
> efficiency with which the low-level details are handled by
> a performant Scheme system
What is a performant Scheme system though? Almost all scheme
implementations I've used are very slow. The only exception is Stalin
which is unreliable, half the programs I write in it (and half its test
suit) just crash, and compiling takes an age.
> (which cannot be matched by C).
What evidence do you have for that??
> > Also, assuming we're
> > talking about the insides of a Scheme implementation use of C allows
> > the programmer to allocate and deallocate memory manually, or to use
> > the GC.
>
> Don't forget that a C program cannot use garbage-collected
> memory without registering its pointers with the collector,
> or by relying on a conservative collector. Both of those
> approaches have costs that performant Scheme systems need
> not incur.
It's true that if pointer do not have to be registered, then the
collector must be conservative. But what garbage collector is not
conservative these days? Almost every scheme implementor has decided
that a conservative GC is the best way to achieve performance.
> > I think the best indicator that Scheme could be beaten is that most
> > Scheme compilers actually output C. Meaning there is at least one C
> > implementation the same speed as the Scheme one, the one the compiler
> > used.
>
> I agree that an implementation of Scheme that compiles
> to C cannot outperform the C code it generates. For the
> applications we've been discussing in this thread, the
> fastest implementations of Scheme do not compile to C.
Original SPARC Larceny?
> On this class of applications, compiling to C instead of
> to native code generally seems to cost about a factor
> of two in performance,
What is it then that causes this slowdown?
It seems most likely to me that this improvement is more because the
implementation in question just implements read better than others do.
Is there really a problem with outputting C so great that it causes a 2
times slowdown in a reader!?
> which is large enough to swamp
> the performance advantages of Scheme. You will see
> this in the benchmark results I will cite when replying
> to Adrian's latest post.
OK, sounds interesting.
> > It does mean something about Scheme implementations though: If
> > implementation writers want underlying algorithms written in Scheme
> > then they should provide functions to do the quite low-level things C
> > allows.
>
> I disagree. Exposing low-level details, and making them
> programmable, tends to interfere with the high-level
> optimizations that are more important for the kind of
> applications we've been talking about in this thread.
> C's hostility to garbage collection is just one example
> of that.
The problem with looking at things that way is that it prevents using
Scheme for things that really require speed. Unfortunately, to make
algorithms go really fast you need to be able to remove the overhead of
things like boxed objects and GC.
> I can't think of any high-level optimization that you can do with
> Scheme that can't be done with C. There are some low-level ones, but
> not many.
It's not the question of *whether* it can/cannot be done in C. The
question is: What high-level optimizations *are* done in C?
>>while enjoying the
>>efficiency with which the low-level details are handled by
>>a performant Scheme system
>
> What is a performant Scheme system though? Almost all scheme
> implementations I've used are very slow.
It may be that all the implementations that *you* used are very slow.
This does not preclude the existence of faster Scheme implementations.
(Some have been in existence for some 20 years btw; some are not out
of research labs yet.)
>>(which cannot be matched by C).
>
> What evidence do you have for that??
These have been stated over and over again in this thread.
>>>Also, assuming we're
>>>talking about the insides of a Scheme implementation use of C allows
>>>the programmer to allocate and deallocate memory manually, or to use
>>>the GC.
>>
>>Don't forget that a C program cannot use garbage-collected
>>memory without registering its pointers with the collector,
>>or by relying on a conservative collector. Both of those
>>approaches have costs that performant Scheme systems need
>>not incur.
>
>
> It's true that if pointer do not have to be registered, then the
> collector must be conservative.
No. Native-code compilers often annotate the code with liveness
information that the collector uses to trace objects. This gives
the collector as much information as is decided by the compiler's
liveness-analysis passes. For example, every return-point may have
a bit-vector attached before it saying how large the frame was and
which slots are live and which are not. This way, the overhead of
keeping this information at runtime is exactly 0. A C compiler
cannot do this for you so you have to inform the GC at every point
in the program about which values are live and which are dead.
So, Will was saying that a Scheme system need not incur the cost of
registering pointers with the GC or using a conservative GC. The
fact that some implementations do incur these costs and pass it down
to the user should give you an idea about what to expect from such
implementations (performance-wise).
While your statement above is false, its converse is true: If the
collector is conservative, then pointers do not have to be registered.
And this is why conservative GC is popular: it's easy to get it working
with little initial effort.
> But what garbage collector is not conservative these days?
Huh? You're not talking about ``precise GC is undecidable'', right?
My compiler implements the "Don't Stop the BiBoP" generational garbage
collector which is described in [1] and implemented as part of the Chez
Scheme environment. Will Clinger has done extensive research in the
area as well and I'm sure you can easily find it.
> Almost every scheme implementor has decided
> that a conservative GC is the best way to achieve performance.
Huh? Conservative GC for performance? Well, that's news. I assume
that by conservative GC you mean something like Boehm and friends, and
NO, these never outperform any reasonable generational garbage
collector.
>>>I think the best indicator that Scheme could be beaten is that most
>>>Scheme compilers actually output C. Meaning there is at least one C
>>>implementation the same speed as the Scheme one, the one the compiler
>>>used.
>>
>>I agree that an implementation of Scheme that compiles
>>to C cannot outperform the C code it generates. For the
>>applications we've been discussing in this thread, the
>>fastest implementations of Scheme do not compile to C.
>
> Original SPARC Larceny?
How about you actually read the discussion thread instead of asking
about what was said over and over ... .
>>On this class of applications, compiling to C instead of
>>to native code generally seems to cost about a factor
>>of two in performance,
>
> What is it then that causes this slowdown?
> It seems most likely to me that this improvement is more because the
> implementation in question just implements read better than others do.
Yes. The trick for implementing a better faster reader is to pick the
right language (and implementation) for doing so.
> Is there really a problem with outputting C so great that it causes a 2
> times slowdown in a reader!?
No. The C code may be optimal as far as C codes go. It's just that the
Scheme readers are twice as fast (for this application).
> The problem with looking at things that way is that it prevents using
> Scheme for things that really require speed. Unfortunately, to make
> algorithms go really fast you need to be able to remove the overhead of
> things like boxed objects and GC.
Do you really think the memory management has an overhead in Scheme but
none in C? Actually, let me paraphrase: Which environment do you think
has lower memory-management overhead, C (malloc/free) or Scheme
(alloc/gc)? Why sort of overhead is associated with boxed objects?
(Hint: The only case I can think of is a boxed floating-point number.
For all other cases, I see no overhead.)
Aziz,,,
[1]
@techreport{ dybvig94dont,
author = "R. Kent Dybvig and David Eby and Carl Bruggeman",
title = "Don't Stop the {BIBOP}: Flexible and Efficient Storage
Management for Dynamically-Typed Languages",
number = "400",
address = "\{dyb,deby,druggema\}@cs.indiana.edu",
year = "1994",
url = "citeseer.ist.psu.edu/dybvig94dont.html" }
That's interesting, since several of them have already been
described in this thread.
> What is a performant Scheme system though? Almost all scheme
> implementations I've used are very slow.
Your experience is typical. Most Scheme programmers
have never used a fast implementation of Scheme, and
that is the main reason they believe Scheme to be slow.
> > (which cannot be matched by C).
>
> What evidence do you have for that??
Twenty-five years of research. Some of it has actually
been published.
> It's true that if pointer do not have to be registered, then the
> collector must be conservative. But what garbage collector is not
> conservative these days? Almost every scheme implementor has decided
> that a conservative GC is the best way to achieve performance.
Chez Scheme, Larceny, Petit Larceny, Gambit, Chicken, MzScheme,
and no doubt many others use precise or partially precise garbage
collectors. (MzScheme once used a conservative collector, but
has added a (partially?) precise collector, presumably because it
performs better.)
Most of the commercial implementations of Java and C# use
precise collectors, because they perform better.
> > I agree that an implementation of Scheme that compiles
> > to C cannot outperform the C code it generates. For the
> > applications we've been discussing in this thread, the
> > fastest implementations of Scheme do not compile to C.
>
> Original SPARC Larceny?
I was thinking of Chez Scheme, but it is true that Larceny
compiles to native Sparc code, and Petit Larceny compiles
to IA32 assembly code. Petit Larceny isn't as convenient
as Larceny, and the native code it generates isn't as well
optimized, but both of those problems will soon be fixed.
> > On this class of applications, compiling to C instead of
> > to native code generally seems to cost about a factor
> > of two in performance,
>
> What is it then that causes this slowdown?
A combination of things. I've been too busy addressing the
wild guesses to write a systematic essay on this, but Adrian
deserves a considered response, and I intend to give him one.
> It seems most likely to me that this improvement is more because the
> implementation in question just implements read better than others do.
What evidence do you have to support that guess?
> Is there really a problem with outputting C so great that it causes a 2
> times slowdown in a reader!?
Yes. At least three problems have already been mentioned
in this thread. There are others.
> The problem with looking at things that way is that it prevents using
> Scheme for things that really require speed. Unfortunately, to make
> algorithms go really fast you need to be able to remove the overhead of
> things like boxed objects and GC.
As Aziz has noted, there is virtually no overhead from boxed
objects when reading, parsing, or macro expanding; with few
exceptions, the representations used in Scheme are at least
as efficient as the representations that would be used in C
or C++.
The idea that garbage collection has overhead, but explicit
deallocation does not, is one of the most pervasive and
destructive myths about programming. For the tree-like
objects that dominate during parsing, macro expansion,
and related activities, garbage collection is usually faster
than explicit deallocation. I urge you to measure this, as
I and other gc researchers have done, instead of repeating
the conventional mythology.
Will
Calls to the OS heap API are not significant, because the
C/C++ memory allocator asks the OS for memory in large
chunks, and those requests are infrequent.
Calls to object destructors can be significant for C++,
but the benchmark data below and on my web pages show that
the extra overhead for calls to destructors is insignificant
compared to the overhead for the call to free/delete.
> These issues can be eliminated by the use of custom memory allocators
> or pools. For example, you could construct a pool that preallocated
> larger chunks of memory and that only allocates objects of a specific
> size (that of the syntax object for example). The pool could also
> designed such that the memory chunks are only deallocated as a whole at
> the end of parsing without any call to each object's destructor,
> resulting in a couple of deallocations rather than tens of thousands.
Excellent thinking! This is commonly done. The problem
with it is that you can't deallocate the pool until *all*
of the objects that reside within it have become useless.
If there are even a few objects within the pool that need
to be passed on to the next phase, then you would create
dangling pointers by deallocating the pool.
You could, of course, copy the few surviving objects out
of the pool before deleting the pool. This works fine,
but it's extremely error-prone because you have to update
all of the pointers to those objects. If you miss even
one, your program will probably crash, though not for a
while.
And how do you know which of the objects survive? Many
C programs determine this by maintaining reference counts.
This too is error-prone, and the overhead of updating those
reference counts is larger than you might think.
The state-of-the-art solution to these problems is use
precise generational garbage collection. Generational
garbage collectors allocate objects in pools, automatically
determine which objects within the pools are still reachable
from program registers and global variables, and automatically
update all pointers to an object when the object is copied
out of a pool prior to reclaiming the pool. Slick, huh?
It's too bad these collectors can't be used in languages
like C and C++. The problem is that you can't copy an
object unless you can reliably find and update *all*
pointers to the object, and the ridiculously low level
and lax typing rules of C/C++ make that unreliable.
Hence garbage collectors for C/C++ must be conservative,
and must never copy any objects.
> This is just speculation of course since I've learned that what you
> think is intuitively true about optimization and performance often
> turns out to be quite different when you actually go and measure
> performance.
How true! I've been measuring this kind of thing for
25 years. The measurements presented below are recent.
> Any other ideas why Scheme would be faster than C++ and Java for heap
> allocation? I thought Java would at least be somewhat equivalent to
> Scheme given that they both use garbage collection.
Right again! See below.
A few days ago, in response to Scott Miller's setup, I
mentioned the GCBench program, which was originally
written in Java as a garbage collection benchmark. It
is too small to serve that purpose these days, but it
still serves reasonably well as a general benchmark for
allocation, initialization, and deallocation of trees
of various sizes [1].
Since creating, initializing, and deallocating abstract
syntax trees happens a lot when you parse or macro-expand
a Scheme program, GCBench is directly relevant to our
discussion.
Here are some timings obtained on a Solaris machine in
our undergraduate lab, with no other users, averaging
three runs:
Compiled:
Larceny v0.91a (fast-safe) 2.9 (4 MB nursery, 35 MB heap)
Chez Scheme v6.1 (opt=2) 3.4
Larceny v0.91a (fast-safe) 3.6 (1 MB nursery, 33 MB heap)
java GCBench (HotSpot) 3.9
Petit Larceny v0.90 4.9 (4 MB nursery, 35 MB heap)
Petit Larceny v0.90 5.6 (1 MB nursery, 33 MB heap)
cc -fast -native gcbench.c 8.0
gcc -O4 gcbench.c 8.4
g++ -O4 gcbench.cpp 8.4
cc gcbench.c 9.2
gcc gcbench.c 9.2
g++ gcbench.cpp 9.3
Scheme compiler A 18.3
Interpreted:
java -Xint GCBench (HotSpot) 12.5
MzScheme v301 (w/module) 20.3
Larceny v0.91a (interpreted) 29.0
Petit Larceny (interpreted) 57.0
Scheme interpreter B 62.0
Scheme interpreter C 70.3
Scheme interpreter A 144.5
There are no surprises here. The fastest systems are
those that compile to native code without going through
C, and also use precise generational collection.
On the Sparc, Petit Larceny compiles to C, but otherwise
uses the same compiler as the version of Larceny that
generates native code directly, and also uses exactly
the same garbage collector. As you can see, Petit
Larceny is almost twice as slow as native Larceny.
The C and C++ code is slower still, because they use
explicit deallocation instead of automatic garbage
collection.
I tried to benchmark several other Scheme compilers,
but Scheme compiler A was the only one I could get to
run. It compiles to C, and the code it generates was
slower than the idiomatic code written in C and C++.
Indeed, it was slower than interpreted Java.
You will find a few other comparisons between Scheme
and Java or C on my web pages [2,3]. Several of those
comparisons show the performance gained by using the
kind of custom allocators that you described for C.
Look especially at the takl, ntakl, diviter, divrec,
gcbench, and perm9 benchmarks.
Will
[1] http://www.hpl.hp.com/personal/Hans_Boehm/gc/gc_bench.html
[2] http://www.ccs.neu.edu/home/will/Twobit/benchmarks1_safe.html
[3] http://www.ccs.neu.edu/home/will/Twobit/benchmarks1_unsafe.html
Remember that the task is to parse a Scheme program, so
the representation of Scheme data has to remain plausible.
The source code, standard input file (nboyer.sch), and code
for the run-benchmark procedure are in
http://www.ccs.neu.edu/home/will/Twobit/Benchmarks/Parsing.java
http://www.ccs.neu.edu/home/will/Twobit/Benchmarks/parsing.sch
http://www.ccs.neu.edu/home/will/Twobit/Benchmarks/nboyer.sch
http://www.ccs.neu.edu/home/will/Twobit/Benchmarks/run-benchmark.chez
Thanks.
Will
In many cases yes. The context of the conversation with William was
the possibility of performance. As you say, often few optimizations
are done in C code because optimizing is relatively more difficult than
in higher level languages. I made this point 5 posts up.
> >>while enjoying the
> >>efficiency with which the low-level details are handled by
> >>a performant Scheme system
> >
> > What is a performant Scheme system though? Almost all scheme
> > implementations I've used are very slow.
>
> It may be that all the implementations that *you* used are very slow.
> This does not preclude the existence of faster Scheme implementations.
> (Some have been in existence for some 20 years btw; some are not out
> of research labs yet.)
OK. Dare I ask which ones they are?
> >>(which cannot be matched by C).
> >
> > What evidence do you have for that??
>
> These have been stated over and over again in this thread.
No it hasn't. What has been given in this thread is a comparison
between a certain C reader implementation, that in mzscheme and in
petite chez. There is not much evidence that the mzscheme reader has
been written with care for efficiency. In fact it looks like its been
written with features first and foremost in the programmers mind.
> >>>Also, assuming we're
> >>>talking about the insides of a Scheme implementation use of C allows
> >>>the programmer to allocate and deallocate memory manually, or to use
> >>>the GC.
> >>
> >>Don't forget that a C program cannot use garbage-collected
> >>memory without registering its pointers with the collector,
> >>or by relying on a conservative collector. Both of those
> >>approaches have costs that performant Scheme systems need
> >>not incur.
> >
> >
> > It's true that if pointer do not have to be registered, then the
> > collector must be conservative.
I was speaking about C here, not generally.
If you're using C with a normal C compiler then if you write it so
pointers don't have to be registered you must write a conservative
allocator.
> No. Native-code compilers often annotate the code with liveness
> information that the collector uses to trace objects. This gives
> the collector as much information as is decided by the compiler's
> liveness-analysis passes. For example, every return-point may have
> a bit-vector attached before it saying how large the frame was and
> which slots are live and which are not. This way, the overhead of
> keeping this information at runtime is exactly 0. A C compiler
> cannot do this for you so you have to inform the GC at every point
> in the program about which values are live and which are dead.
Yes. (As I expect you know you can get even cleverer with the liveness
information).
> So, Will was saying that a Scheme system need not incur the cost of
> registering pointers with the GC or using a conservative GC. The
> fact that some implementations do incur these costs and pass it down
> to the user should give you an idea about what to expect from such
> implementations (performance-wise).
Some of the fastest Scheme implementation use conservative GC, such as
Bigloo.
> While your statement above is false, its converse is true: If the
> collector is conservative, then pointers do not have to be registered.
> And this is why conservative GC is popular: it's easy to get it working
> with little initial effort.
Yes.
> > But what garbage collector is not conservative these days?
>
> Huh? You're not talking about ``precise GC is undecidable'', right?
No. I was just saying that in general implementations use conservative
GC.
In particular the following use conservative GC:
Guile
Bigloo
Stalin (as far as I can tell)
mzscheme
Some of the above implementations are fast, especially Bigloo and
Stalin.
I didn't know there were any scheme implementations that didn't use
conservative GC except Scheme 48.
> My compiler implements the "Don't Stop the BiBoP" generational garbage
> collector which is described in [1] and implemented as part of the Chez
> Scheme environment. Will Clinger has done extensive research in the
> area as well and I'm sure you can easily find it.
Thanks, I'll have a look.
> > Almost every scheme implementor has decided
> > that a conservative GC is the best way to achieve performance.
>
> Huh? Conservative GC for performance? Well, that's news. I assume
> that by conservative GC you mean something like Boehm and friends, and
> NO, these never outperform any reasonable generational garbage
> collector.
Well, by conservative GC I mean conservative GC not necessarily Boehm's
GC.
I'm sure a generational collector will outperform a normal conservative
GC, but aren't most generatational GCs also conservative?
> > Original SPARC Larceny?
>
> How about you actually read the discussion thread instead of asking
> about what was said over and over ... .
I can tell you're talking about Petite Chez. But you also talk about
another implementation without naming it, which implementation is that?
> > Is there really a problem with outputting C so great that it causes a 2
> > times slowdown in a reader!?
>
> No. The C code may be optimal as far as C codes go. It's just that the
> Scheme readers are twice as fast (for this application).
I doubt that. Using C it is possible to precisely control almost
everything going on in a program, and therefore extract very high
performance. The same thing is not possible in standard Scheme. The
problem is the lengths that the programmer must go to to get the
performance.
> > The problem with looking at things that way is that it prevents using
> > Scheme for things that really require speed. Unfortunately, to make
> > algorithms go really fast you need to be able to remove the overhead of
> > things like boxed objects and GC.
>
> Do you really think the memory management has an overhead in Scheme but
> none in C? Actually, let me paraphrase: Which environment do you think
> has lower memory-management overhead, C (malloc/free) or Scheme
> (alloc/gc)?
You don't only have the option of using malloc/free as they are in C,
there are many memory allocation methods you can use.
Anyway, the answer is it depends on the code. When properly used
malloc/free have very little overhead. Similarly, when GC is used
properly it has very little overhead.
The problem in both cases is figuring out how to use them properly.
> Why sort of overhead is associated with boxed objects?
> (Hint: The only case I can think of is a boxed floating-point number.
> For all other cases, I see no overhead.)
The overhead I mean is the overhead in interrogating the type
information in the object before doing an operation. The memory
overhead isn't that large.
The point I was making above was this:
If the programmer has a high degree of control over the code outputted
then very high-performance is possible, at the cost of the programmer
spending time and complexity using this control. Another associated
problem is that not many programmers know how to use it.
If the programmer has less control then less performance is possible.
This is why things like OS kernels are written in C not Scheme.
> I'm sure a generational collector will outperform a normal
> conservative GC, but aren't most generatational GCs also
> conservative?
Most generational GCs distinguish generations by address intervals,
i.e. they physically move objects at least from the youngest
generation to others.
(At least those in Glasgow Haskell, OCaml, Sun's Java, and Microsoft's
.NET.)
Copying and conservative designs are incompatible.
--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
* The possibility of using a precise generational GC.
I don't understand why this isn't possible for an implementation that
outputs C.
Such an implementation could easily be written to tag every pointer for
GC, couldn't it?
* Aziz mentioned also not being able to allocate registers across
calls. This is generally true though some C compilers are actually
able to do it. I've never seen a situation where it is a problem
though. Regarding how it affects allocation, it could be a problem if
you wanted to do lots of small allocations. But -at least in C- that
isn't really good practice, it's best to allocate a decent sized chunk
at once.
I can't find the third thing mentioned. (Alignment would be an obvious
one, but I don't think it's applicable to this problem).
> > What is a performant Scheme system though? Almost all scheme
> > implementations I've used are very slow.
>
> Your experience is typical. Most Scheme programmers
> have never used a fast implementation of Scheme, and
> that is the main reason they believe Scheme to be slow.
Correct, that would be my experience.
> > > (which cannot be matched by C).
> >
> > What evidence do you have for that??
>
> Twenty-five years of research. Some of it has actually
> been published.
Wow, where?
> > It's true that if pointer do not have to be registered, then the
> > collector must be conservative. But what garbage collector is not
> > conservative these days? Almost every scheme implementor has decided
> > that a conservative GC is the best way to achieve performance.
>
> Chez Scheme, Larceny, Petit Larceny, Gambit, Chicken, MzScheme,
> and no doubt many others use precise or partially precise garbage
> collectors. (MzScheme once used a conservative collector, but
> has added a (partially?) precise collector, presumably because it
> performs better.)
>
> Most of the commercial implementations of Java and C# use
> precise collectors, because they perform better.
Interesting, I had no idea precise collectors were so widespread. I
tried to find the GC method used by some of those implementation in the
past and couldn't because it's not mentioned in the manuals.
> > >For the
> > > applications we've been discussing in this thread, the
> > > fastest implementations of Scheme do not compile to C.
> >
> > Original SPARC Larceny?
>
> I was thinking of Chez Scheme, but it is true that Larceny
> compiles to native Sparc code, and Petit Larceny compiles
> to IA32 assembly code. Petit Larceny isn't as convenient
> as Larceny, and the native code it generates isn't as well
> optimized, but both of those problems will soon be fixed.
>
> > It seems most likely to me that this improvement is more because the
> > implementation in question just implements read better than others do.
>
> What evidence do you have to support that guess?
It is a guess, as you say :)
But my general experience of programming has been if you really want
something to go fast then you write it in C and pay great attention to
the algorithms used.
Most code that I've seen is slow either because the programmer did not
know how to make it fast, because they did not care, or because the
pain of doing so was too great.
> > The problem with looking at things that way is that it prevents using
> > Scheme for things that really require speed. Unfortunately, to make
> > algorithms go really fast you need to be able to remove the overhead of
> > things like boxed objects and GC.
>
> As Aziz has noted, there is virtually no overhead from boxed
> objects when reading, parsing, or macro expanding; with few
> exceptions, the representations used in Scheme are at least
> as efficient as the representations that would be used in C
> or C++.
Well, I was talking about programming in general there. I was thinking
about using the objects, not just reading, parsing and macro expanding.
But if you're going to do those things the data must be boxed anyway.
> The idea that garbage collection has overhead, but explicit
> deallocation does not, is one of the most pervasive and
> destructive myths about programming.
I agree, I never said it didn't.
> For the tree-like
> objects that dominate during parsing, macro expansion,
> and related activities, garbage collection is usually faster
> than explicit deallocation. I urge you to measure this, as
> I and other gc researchers have done, instead of repeating
> the conventional mythology.
Well, the results from research on the subject are mixed. Wolfram
Gloger found that GC using Boehms GC was much slower than using several
different mallocs across a range of programs. (Though this could be the
fault of Boehms GC of-course).
I listened to all the stuff about how GC can be faster than manual
allocation up until when I actually started using it. I've found that
GC causes slowdown in my programs but I've not had the same problem
with manual allocation. This may be because I know a little more about
how to use manual allocation than many programmers seem to. I could
also be because I've never come across a precise generational GC
before.
Interesting thread this, I'll have to try one of these new GC and see
how they work.
Generational garbage collection was invented by Lieberman
and Hewitt in 1981 and published in CACM in 1983. The book
by Jones and Lin will give you some idea of its history.
The GC FAQ at http://www.iecc.com/gclist/GC-faq.html will
give you some idea without having to go to a library.
For my own published research on this subject, you can
search Google Scholar for Clinger, garbage. To give you
an idea of my unpublished involvement with this subject,
look at Pat Caudill's and Allen Wirfs-Brock's paper at
the first OOPSLA. I was working at Tektronix Computer
Laboratory by then, but can't say much about my research
there.
> > As Aziz has noted, there is virtually no overhead from boxed
> > objects when reading, parsing, or macro expanding; with few
> > exceptions, the representations used in Scheme are at least
> > as efficient as the representations that would be used in C
> > or C++.
>
> Well, I was talking about programming in general there. I was thinking
> about using the objects, not just reading, parsing and macro expanding.
> But if you're going to do those things the data must be boxed anyway.
Please note that my inter-language performance claims in
this thread have all been qualified by reference to the
tasks we've been discussing: reading, parsing, and macro
expansion.
> > For the tree-like
> > objects that dominate during parsing, macro expansion,
> > and related activities, garbage collection is usually faster
> > than explicit deallocation. I urge you to measure this, as
> > I and other gc researchers have done, instead of repeating
> > the conventional mythology.
>
> Well, the results from research on the subject are mixed. Wolfram
> Gloger found that GC using Boehms GC was much slower than using several
> different mallocs across a range of programs. (Though this could be the
> fault of Boehms GC of-course).
I'm not talking about a broad range of programs. I'm talking
about specific kinds of programs, and these specific programs
allocate many small objects. Hans Boehm has conducted many
experiments that show his conservative GC is usually faster
than explicit deallocation for these kinds of programs, and
I once verified some of his claims using GCBench and several
other benchmarks, some of which I wrote or improved.
Note, however, that my claim was that, for programs such as
Scheme readers, parsers, and macro expanders, "garbage
collection is usually faster than explicit deallocation."
I said nothing about the performance of conservative garbage
collection, which is primarily used for C/C++ and other
gc-hostile environments. While conservative collection is
indeed used by several implementations of Scheme, primarily
because the Boehm-Demers-Weiser collector is well-engineered
and performs adequately out of the box, the implementations
of Scheme that perform best for the class of applications we
are considering do not use conservative collection. Drawing
conclusions from the performance of conservative garbage
collection is like drawing conclusions from the performance
of an unusually inefficient implementation of malloc/free.
Finally, note that Wolfram Gloger is concerned with C and C++.
Because his interest is limited to those languages, he has no
interest in benchmarking a modern garbage collector.
> I listened to all the stuff about how GC can be faster than manual
> allocation up until when I actually started using it. I've found that
> GC causes slowdown in my programs but I've not had the same problem
> with manual allocation. This may be because I know a little more about
> how to use manual allocation than many programmers seem to. I could
> also be because I've never come across a precise generational GC
> before.
You said you have never used a fast implementation of Scheme.
Your only experience with modern garbage collection may have
come in languages like Java and C#, and you probably haven't
benchmarked those languages very carefully for this class of
applications.
Now to address some of the things you wrote in your response
to Aziz.
First of all, Chez Scheme is not the same thing as Petite
Chez. Although they probably use the same garbage collector,
Chez Scheme is a far more performant system.
> No. I was just saying that in general implementations use
> conservative GC.
It may be true that most implementations of Scheme use
conservative GC, but we have already agreed that most
implementations of Scheme are slow. A couple of the
faster implementations of Scheme also use conservative GC,
but they are not the fastest implementations of Scheme for
the kind of applications we're talking about here, partly
because they use conservative GC.
> I doubt that. Using C it is possible to precisely control almost
> everything going on in a program, and therefore extract very high
> performance.
C does not allow you to control the things that matter most
for the class of applications we're talking about here. In
reality, C interferes with your control of those things:
proper tail recursion, interprocedural register allocation,
and the gc invariants on which modern collectors rely.
By the way, all three of those things have been mentioned
before in this thread.
Will
> * The possibility of using a precise generational GC.
> I don't understand why this isn't possible for an implementation that
> outputs C.
It is definitely possible, various language implementations do that
(Chicken, Glasgow Haskell Compiler, my Kogut compiler).
But it's impractical to have a precise GC for manually written C code.
Automatically generated C code which uses a precise GC requires a
discipline of putting variables in places where the GC can find them.
A common technique (used by the three compilers mentioned above)
is to keep function arguments and local variables on a custom stack,
A non-standard calling convention implied by this (which also enables
other features like tail call optimization and lightweight threads)
means that it's not applicable to usual C programs, and writing large
programs by hand this way would be too painful to be practical.
> Well, the results from research on the subject are mixed. Wolfram
> Gloger found that GC using Boehms GC was much slower than using
> several different mallocs across a range of programs. (Though this
> could be the fault of Boehms GC of-course).
Boehm GC is usually slower than malloc, while a generational GC
is often faster.
> I listened to all the stuff about how GC can be faster than manual
> allocation up until when I actually started using it. I've found that
> GC causes slowdown in my programs but I've not had the same problem
> with manual allocation.
Which language implementations do you mean?
GNU Emacs and XEmacs use precise GC internally. Objects that need to
be protected in GC need to be surrounded by labelling macros. This is
incredibly evil. But, the emacsen work that way, and they contain
quite a lot of C code written this was successfully. Of-course neither
have generational GC.
There are also mostly-copying collectors. These are generational and
conservative, and they don't require the stack to be scanned precisely.
http://www.cs.cornell.edu/Info/People/fms/mcc/gc.html
I expect writing a C program that uses a generational GC would be very
nasty, but not impossible.
> > Well, the results from research on the subject are mixed. Wolfram
> > Gloger found that GC using Boehms GC was much slower than using
> > several different mallocs across a range of programs. (Though this
> > could be the fault of Boehms GC of-course).
>
> Boehm GC is usually slower than malloc, while a generational GC
> is often faster.
>
> > I listened to all the stuff about how GC can be faster than manual
> > allocation up until when I actually started using it. I've found that
> > GC causes slowdown in my programs but I've not had the same problem
> > with manual allocation.
>
> Which language implementations do you mean?
Schemes and lisps. Emacs lisp, CMUCL, guile etc.
So, let me get this straight:
First, you are a good C programmer who knows alot about fast allocation
in C/C++.
Next, you try some of the slowest interpreters for scheme/elisp (don't
know how cmucl compares to acl/sbcl/etc). Your programs are now dog-
slow, surprised?
Finally, all fog is cleared and you unravel the mystery of the lost
performance: "I've found that GC causes slowdown in my programs".
Great!
Aziz,,,
No :)
> Finally, all fog is cleared and you unravel the mystery of the lost
> performance: "I've found that GC causes slowdown in my programs".
I try to choose a good implementation for the job of development.
The Lisp and scheme implementations I mentioned have fairly good
documentation, which is important. Often more important than speed.
I've tried using some of the faster implementations, such as Stalin and
SBCL, but had problems. I'll probably try again if their development
settles down in the future. I'll certainly try Petite Larceny sometime
soon.
Part of the problem with GC is not the existence of good GCs, but that
they're relatively rare.
> Finally, all fog is cleared and you unravel the mystery of the lost
> performance: "I've found that GC causes slowdown in my programs".
In my experience, that's not correct. The slowdown can be traced
back to the lack of semicolons in the language.
In C and C++ you are encouraged to use semicolons at the end
of every statement. In Scheme, nobody asks you to do that. The
result? an amazing slowdown in all scheme programs. Running
the following test with and without semicolons results in times of
0 ms and 2690 ms respectively. That's a HUGE difference.
> (define (with-semicolons n);
(cond;
((= n 0);
'done;
);
(else; (with-semicolons (- n 1));
'done;
)))
> (define (without-semicolons n)
(cond
((= n 0)
'done
)
(else (without-semicolons (- n 1))
'done
)))
> (time (with-semicolons 1000000))
(time (with-semicolons 1000000))
no collections
0 ms elapsed cpu time
0 ms elapsed real time
0 bytes allocated
done
> (time (without-semicolons 1000000))
(time (without-semicolons 1000000))
22 collections
2690 ms elapsed cpu time, including 1630 ms collecting
2815 ms elapsed real time, including 1693 ms collecting
24485136 bytes allocated, including 100784 bytes reclaimed
done
I urge you all to start using semicolons in your scheme
programs so we can get close to the performance of
C/C++
-ggem.
http://portal.acm.org/citation.cfm?id=1065027
sam th
That methodology is even suspect for explicit deallocation, since there
tend to be large cache interactions between the allocator and the
client. If the allocator causes more of the newly allocated object to
end up in the cache, the allocator looks bad, but in the real world the
client would run appreciably faster.
Hans