how to adapt BC code for Racket CS?

259 views
Skip to first unread message

Matthew Butterick

unread,
Feb 22, 2020, 11:05:35 AM2/22/20
to Racket Users
Of the convergence of Racket BC and CS performance, Matthew Flatt wrote: [1]

> development to date has been aimed at making Racket CS match the performance of Racket BC on a code base that was developed and tuned for Racket BC, and that obviously gives Racket BC an advantage. For example, Racket BC makes dormant code relatively cheap, so Racket libraries generate a lot of code. Future Racket libraries will likely shift to take advantage of Racket CS’s cheaper function calls and dramatically cheaper continuations.

1) As a package maintainer with zero exposure to Chez Scheme, how do I start to optimize for Racket CS? Are there certain patterns and idioms from BC that should be systematically eradicated? For instance, how would I "take advantage of ... cheaper function calls"?

2) With Racket BC, I've generally found that the JIT optimizes my code better than I do. So I write the simplest code and ignore the details. Is that less true with CS? Should I be thinking more about manual tuning?

3) Lurking here, I gather, is the bugaboo that even though core Racket runs on CS, the existing base libraries are optimized for BC. Is it considered important to optimize these for CS, in order to get the most benefit from the switch? How can Racket developers contribute to this?

4) Building both BC and CS is easy (thank you). What would be the most reliable technique for doing performance comparisons as one refactors code? Is it as simple as running `raco test ···` and `racocs test ···` in succession?




[1] https://blog.racket-lang.org/2020/02/racket-on-chez-status.html

Jack Firth

unread,
Feb 22, 2020, 5:13:56 PM2/22/20
to Racket Users
I'm no expert, but I have a feeling that optimizing for Racket CS will involve relying more on function calls and indirection, especially to avoid generating (or just writing) large amounts of code. So I suspect we'll see more of the following:

1. Implementing macro-generating macros by having the root macro expand to calls to a helper function to create the generated macro. For example, `(define-syntax generated-macro (make-my-generated-macro ...))` instead of `(define-syntax-rule (generated-macro ...) ...)`. This is already a good idea in Racket BC, because it vastly cuts down on expanded code size. Racket CS will probably make the effect more pronounced because the function call will be faster and the price of generating unnecessary code will rise.

2. More reliance on generic interfaces. The dynamic dispatch of generic interfaces is much cheaper on Racket CS, so passing around sequences instead of lists suddenly doesn't seem like such a bad idea anymore. Maybe we'll even start using generic collections.

3. Less avoidance of generators and similar features that use continuations under the hood, since continuations won't tank performance as much anymore.

Further down the line, we might also see more opportunities to leverage parallelism. Making the Racket BC VM allow parallel execution (so Racket code could use multiple thread without firing up a separate place - and thus an entire separate VM - for each thread) would be an infeasible amount of work, but on ChezScheme the possibilities look brighter.

For mere mortal users of Racket, I think a good way to contribute to this effort would be to explore the performance costs of generic interfaces on Racket CS. Make some benchmarks, try using collections other than just lists and hashes, and see what happens in real-world codebases. Don't just look at the speed differences either: consider what impact the flexibility of generic interfaces has on the structure of the codebase.

Alexis King

unread,
Feb 23, 2020, 1:57:51 AM2/23/20
to Racket Users, Matthew Butterick, Jack Firth
I have a related but distinct question about the blog post: I’m curious what guarantees I can come to expect from the Racket CS optimizer. The post includes the following example:

(for/fold ([v #f]) ([i (in-range N)]) i)

On both Racket BC and Racket CS, I’d expect the optimizer to turn this into a tight recursive loop that looks something like this:

(let loop ([v #f]
[i 0])
(define j (add1 i))
(if (< j N)
(loop i j)
i))

But my understanding is that Chez’s optimizer is much more sophisticated than Racket BC’s. At the very least, I’d expect it to be able to identify that `v` is never used here and simplify it to this:

(let loop ([i 0])
(define j (add1 i))
(if (< j N)
(loop j)
i))

Now I’d wonder: if N is a known constant, could the optimizer just delete the loop? Could the whole thing be optimized to (sub1 N)? Given the performance numbers in the blog post, I’m guessing the answer is no. Is the reason just that Chez doesn’t perform this kind of optimization, or is there something more fundamental?

Matthew Flatt

unread,
Feb 23, 2020, 7:58:36 AM2/23/20
to Matthew Butterick, Jack Firth, Alexis King, Racket Users
[Replying to three messages]

At Sat, 22 Feb 2020 08:05:28 -0800, Matthew Butterick wrote:
> 1) As a package maintainer with zero exposure to Chez Scheme, how do I start
> to optimize for Racket CS? Are there certain patterns and idioms from BC that
> should be systematically eradicated?

No. For most Racket users, you should *not* try to optimize for Racket
CS, and Racket CS should just do the right thing.

> For instance, how would I "take advantage of ... cheaper function
> calls"?

Jack's answer applies here. That is, you could take advantage of
cheaper function calls by worrying about them even less.

> 2) With Racket BC, I've generally found that the JIT optimizes my code better
> than I do. So I write the simplest code and ignore the details. Is that less
> true with CS? Should I be thinking more about manual tuning?

You should treat Racket BC and CS the same in this way.

> 3) Lurking here, I gather, is the bugaboo that even though core
> Racket runs on CS, the existing base libraries are optimized for BC.
> Is it considered important to optimize these for CS, in order to get
> the most benefit from the switch? How can Racket developers
> contribute to this?

I'd say existing libraries are optimized for BC in the sense that they
avoid some things that turn out to be cheaper in CS, so they might not
have avoided those things if started on CS. Meanwhile, for things that
turned out to be cheaper in BC, my strategy has been to make them
cheaper in CS, too.

> 4) Building both BC and CS is easy (thank you). What would be the
> most reliable technique for doing performance comparisons as one
> refactors code? Is it as simple as running `raco test ···` and
> `racocs test ···` in succession?

Yes.


At Sat, 22 Feb 2020 14:13:55 -0800 (PST), Jack Firth wrote:
> I'm no expert, but I have a feeling that optimizing for Racket CS will
> involve relying more on function calls and indirection, especially to avoid
> generating (or just writing) large amounts of code.

I'm reluctant to call that "optimizing for Racket CS", though. It's
closer to "not optimizing manually".

> I think a good way to contribute to this
> effort would be to explore the performance costs of generic interfaces on
> Racket CS. Make some benchmarks, try using collections other than just
> lists and hashes, and see what happens in real-world codebases. Don't just
> look at the speed differences either: consider what impact the flexibility
> of generic interfaces has on the structure of the codebase.

Agreed.


At Sun, 23 Feb 2020 12:27:42 +0530, Alexis King wrote:
> I’m curious what guarantees I can come to expect from the Racket CS
> optimizer.

I think you have in mind what I would call high-level optimizations,
and there's not much difference between Racket CS and BC at that level.

> But my understanding is that Chez’s optimizer is much more
> sophisticated than Racket BC’s.

I wouldn't characterize it that way. If anything, Racket BC is more
sophisticated and aggressive than Chez Scheme on high-level
optimizations, especially around type reconstruction and cross-module
optimizations. (For Racket CS, Gustavo added a type-reconstruction pass
to Chez Scheme, and schemify handles cross-module optimization.)

Chez Scheme's advantages are more on the back end: register allocation,
calling conventions, and continuation management.

> At the very least, I’d expect it to be able to identify that `v`
> is never used here and simplify it to this:
>
> (let loop ([i 0])
> (define j (add1 i))
> (if (< j N)
> (loop j)
> i))

None of Racket BC, Racket CS, or Chez Scheme by itself will optimize
away the unused loop argument. That kind of interprocedural dead-code
elimination is out of reach for the current compilers, except to the
degree that inlining turns them into intraprocedural questions.

> Now I’d wonder: if N is a known constant, could the optimizer just delete the
> loop?

Only for short loops: Racket BC will turn that into a constant for N up
to 15. Racket CS will turn it into a constant only for N as 0. (Well,
almost; both of them leave a `check-range` call behind, because they're
not willing to inline it.) The difference between Racket CS and BC in
this case reflects more aggressive inlining in Racket BC.

> Could the whole thing be optimized to (sub1 N)? Given the performance
> numbers in the blog post, I’m guessing the answer is no. Is the reason just
> that Chez doesn’t perform this kind of optimization, or is there something
> more fundamental?

Converting loops into closed form requires more sophisticated and
general induction reasoning than is typical in a compiler. Or it needs
ad hoc pattern patching to cover a few cases --- which I have seen gcc
do, but I don't think that's very common.

Gustavo Massaccesi

unread,
Feb 23, 2020, 9:25:20 PM2/23/20
to Matthew Flatt, Matthew Butterick, Jack Firth, Alexis King, Racket Users
Some minor remarks:

On Sun, Feb 23, 2020 at 9:58 AM Matthew Flatt <mfl...@cs.utah.edu> wrote:
[Replying to three messages]

At Sat, 22 Feb 2020 08:05:28 -0800, Matthew Butterick wrote:
> 1) As a package maintainer with zero exposure to Chez Scheme, how do I start
> to optimize for Racket CS? Are there certain patterns and idioms from BC that
> should be systematically eradicated?

No. For most Racket users, you should *not* try to optimize for Racket
CS, and Racket CS should just do the right thing.


I agree, 99.9% of the time just write nice idiomatic code and hope the optimizer will do the right thing. 


 
> But my understanding is that Chez’s optimizer is much more
> sophisticated than Racket BC’s.

I wouldn't characterize it that way. If anything, Racket BC is more
sophisticated and aggressive than Chez Scheme on high-level
optimizations, especially around type reconstruction and cross-module
optimizations. (For Racket CS, Gustavo added a type-reconstruction pass
to Chez Scheme, and schemify handles cross-module optimization.)

Chez Scheme's advantages are more on the back end: register allocation,
calling conventions, and continuation management.

About the type recovery part: RacketCS can get the type information from more primitives than RacketBC.
It was very handy that Chez Scheme had a list with the signatures of all the primitives.
On the other hand, RacketCS still doesn't have some tricks like
    (bitwise-and <exact-integer?> <positive-fixnum?>)  ==>  <positive-fixnum?> 


 
> At the very least, I’d expect it to be able to identify that `v`
> is never used here and simplify it to this:
>
>     (let loop ([i 0])
>       (define j (add1 i))
>       (if (< j N)
>           (loop j)
>           i))

None of Racket BC, Racket CS, or Chez Scheme by itself will optimize
away the unused loop argument. That kind of interprocedural dead-code
elimination is out of reach for the current compilers, except to the
degree that inlining turns them into intraprocedural questions.

I'm more optimistic, and I think that detecting that an argument is unused and deleting it is not so difficult, at east in thigh loops. It is similar to the lifting of closures that is already done.
If this pattern is common enough, I think this can be added in a week as an independent pass (that's like a month of wall clock time :) ). 
(This is one of the advantages of RacketCS. It's much easier to write complex transformations. Writing this in the traditional Racket is theoretical possible, but it would be more difficult.)
(It would be better to add it as part of cp0 or as part of cptypes, or as part of a more complete pass about loops. So don't expect to see this in a month.)  



Gustavo

Alexis King

unread,
Feb 23, 2020, 10:25:42 PM2/23/20
to Matthew Flatt, Racket Users
> On Feb 23, 2020, at 18:28, Matthew Flatt <mfl...@cs.utah.edu> wrote:
>
> None of Racket BC, Racket CS, or Chez Scheme by itself will optimize
> away the unused loop argument. That kind of interprocedural dead-code
> elimination is out of reach for the current compilers, except to the
> degree that inlining turns them into intraprocedural questions.

Got it, thanks, that makes sense. For perspective, I am coming from GHC-land, where such optimizations are near-guaranteed: the demand analyzer would see the argument is unused and drop it after a worker/wrapper split. That said, I realize the luxury of purity and non-strict evaluation makes it easier for GHC to be much more aggressive.

> Converting loops into closed form requires more sophisticated and
> general induction reasoning than is typical in a compiler. Or it needs
> ad hoc pattern patching to cover a few cases --- which I have seen gcc
> do, but I don't think that's very common.

Yes, that makes sense, and this part is not shocking. I don’t think GHC manages that optimization, either.

Hendrik Boom

unread,
Feb 24, 2020, 7:31:25 AM2/24/20
to Racket Users
On Sun, Feb 23, 2020 at 05:58:30AM -0700, Matthew Flatt wrote:

>
> Converting loops into closed form requires more sophisticated and
> general induction reasoning than is typical in a compiler. Or it needs
> ad hoc pattern patching to cover a few cases --- which I have seen gcc
> do, but I don't think that's very common.
>

If interest here, though not really converting loops into closed form:

A language I saw proposed in the 70's has a loop construct where the
request to start a new interaation was done by an 'again' statement.

loop i := 3
...
...
if ...
then
...
...
again i+1
endif
endloop

Of course the again could occur in positions where it was a genuine
recursive call. In most situation, though, it would be recognised as a
tail-recursion and be treated like a normal iteration.

And yes, there could be more than one loop variable, which would
tranlate into more than one function parameter.

In a syntactic situation like this, it would be easier to eliminate a
loop variable that was not used, because all the calls to the loop body
would be explicitly present within the syntax.

-- hendrik

Matthew Butterick

unread,
Feb 25, 2020, 9:49:02 AM2/25/20
to Matthew Flatt, Racket Users
What can you say about places & parallelism under CS vs. BC? This is one area that I find CS to reliably underperform BC (by about a factor of 2). Place creation seems slower under CS. More interestingly however, the utilization of multiple cores seems inefficient. For instance, when I run a full parallel Pollen render under BC, my cores jump to 90-100% and stay there. Fans speed up, struggling to eject the heat of this awesome computation. Whereas under CS, the cores saunter along in a more leisurely range of about 40-50% (which would itself account for the 2x perf difference).

Matthew Flatt

unread,
Feb 25, 2020, 10:05:49 AM2/25/20
to Matthew Butterick, Racket Users
At Tue, 25 Feb 2020 06:48:53 -0800, Matthew Butterick wrote:
> What can you say about places & parallelism under CS vs. BC? This is
> one area that I find CS to reliably underperform BC (by about a
> factor of 2). Place creation seems slower under CS. More
> interestingly however, the utilization of multiple cores seems
> inefficient.

Place creation probably appears slower because loading code is slower.
The creation of places is actually around 10 times as fast. To check
that, try

#lang racket/base
(require racket/place)

(time
(for ([i (in-range 10)])
(place-wait (dynamic-place '(file "/tmp/go.rkt") 'void))))

where "/tmp/go.rkt" starts as

(module m '#%kernel
(#%provide void))

The loop should run around 10 times as fast in CS. But change
`'#%kernel` in "go.rkt" to `racket/base`, and the CS is only about 2
times as fast. Change it to `racket`, and CS is about the same or
slower.


Utilization of multiple cores is lower primarily (I think) due to a
shared allocation heap in CS:

* BC gives each place its own allocation heap (with one extra heap for
shared objects, such as place channels), so different places can
allocate and GC independently.

At least, that's true up to the limits of a shared page table. That
limit becomes significant at around 10 places, because generational
GC relies on page-level protection, and then OS pagetable operations
become a bottleneck.

* CS has a single heap with a single-threaded, stop-the-world GC ---
so allocation and GC easily become a bottleneck.

If GHC's experience is any guide, making the GC itself multithreaded
may address much of this problem.

Locks on shared system data structures may also be a significant
obstacle to CPU utilization with places in CS, but I'm not sure.

Matthew Butterick

unread,
Feb 25, 2020, 12:03:32 PM2/25/20
to Matthew Flatt, Racket Users

> On Feb 25, 2020, at 7:05 AM, Matthew Flatt <mfl...@cs.utah.edu> wrote:
>
> * CS has a single heap with a single-threaded, stop-the-world GC ---
> so allocation and GC easily become a bottleneck.
>
> If GHC's experience is any guide, making the GC itself multithreaded
> may address much of this problem.
>
> Locks on shared system data structures may also be a significant
> obstacle to CPU utilization with places in CS, but I'm not sure.



FWIW some quick timings on a Pollen render of practicaltypography.com. Though extra cores have diminishing net returns under Racket BC, the returns are still positive. Under Racket CS, by contrast, net performance degrades with more than 4 cores.

Racket BC

single core
real 4m21.191s
user 3m37.940s
sys 0m42.388s

parallel @ 2 cores
real 2m46.235s
user 4m22.160s
sys 0m56.270s

parallel @ 3 cores
real 1m54.134s
user 4m10.330s
sys 0m54.533s

parallel @ 4 cores
real 1m43.055s
user 4m46.933s
sys 1m5.948s

parallel @ 6 cores
real 1m34.783s
user 6m8.522s
sys 1m32.125s

parallel @ 8 cores
real 1m18.137s
user 6m24.778s
sys 1m38.617s

parallel @ 12 cores
real 1m14.924s
user 8m30.239s
sys 2m14.671s

Racket CS

single core
real 5m1.422s
user 4m16.300s
sys 0m44.253s

parallel @ 2 cores
real 3m25.016s
user 4m45.385s
sys 0m54.634s

parallel @ 3 cores
real 2m52.780s
user 4m57.951s
sys 1m3.184s

parallel @ 4 cores
real 2m42.471s
user 5m22.796s
sys 1m17.889s

parallel @ 6 cores
real 2m44.513s
user 6m26.700s
sys 1m54.549s

parallel @ 8 cores
real 2m56.782s
user 8m4.029s
sys 2m58.554s

parallel @ 12 cores
real 3m2.116s
user 9m34.846s
sys 5m5.443s



John Cowan

unread,
Feb 25, 2020, 12:13:07 PM2/25/20
to Matthew Butterick, Matthew Flatt, Racket Users
Perhaps separate OS processes would be a win in this case?

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/3D11BBCF-B0FE-473B-8997-09B7CB60D761%40mbtype.com.

Siddhartha Kasivajhula

unread,
Apr 17, 2020, 9:09:18 PM4/17/20
to John Cowan, Matthew Butterick, Matthew Flatt, Racket Users
I've been regularly running "smoke" benchmarks for the relation library, and just updated from Racket 7.5 (BC) to Racket CS 7.6. In case the data is useful to anybody, here are the before/after times. These benchmarks exercise elementary operations like comparisons for equality and order, function composition, type transformations, algebraic composition of numbers, strings, and so on. On these benchmarks, Racket CS seems to be faster across the board:

Equivalence relations:
BC: 529ms
CS: 444ms

BC: 1661ms
CS: 1347ms

Order relations:
BC: 778ms
CS: 539ms

BC: 4621ms
CS: 3389ms

Functional operations:
BC: 442ms
CS: 435ms

BC: 2562ms
CS: 2159ms

Type transformations:
BC: 410ms
CS: 354ms

BC: 599ms
CS: 598ms

Algebraic operators:
BC: 600ms
CS: 494ms

BC: 8876ms
CS: 6740ms






Reply all
Reply to author
Forward
0 new messages