[PATCH] v4 recycle kernel cache via weak pointer

27 views
Skip to first unread message

Qian Yun

unread,
May 15, 2026, 12:28:23 AMMay 15
to fricas-devel
Added support for clisp, cmucl, ecl, sbcl.

Benchmark (execution time and memory footprint) for mapleok.input:
(after patch vs before patch)

====
)time long
)storage long
systemCommand "read src/input/mapleok.input"
====

sbcl:
10.2s 7.5GB vs 14.4s 13.5GB

cmucl:
17.4s 5.3GB vs 24.9s 8.3GB

clisp:
132s 7.7GB vs 222s 13.7GB

ecl:
46s vs 79s


I've added debugging facility, so that you can test its effects
without compiling twice:

by using "setExpandCacheThreshold(6000)$SCACHE Kernel EXPR INT"
you effectively disable recycle on kernel cache.


I'll try to find more benchmarks as well.

BTW, lispworks and clozurecl doesn't have weak pointers,
but lispworks has weak hashtable and clozurecl has weak array.
They can be implemented, but a bit more complicated.

- Qian
recycle-kernel-cache-v4.patch

Qian Yun

unread,
May 15, 2026, 5:54:48 AMMay 15
to fricas-devel
Benchmark on integ.input (replace the 2 "testcase" with
"testcaseNoClear"):

Kernels drop from 4461 to 831 with patch.

(before vs after)

clisp:
78.6s (18.0s GC) 4.4GB vs 70.1s (16.8s GC) 4.2GB

sbcl:
8.9s (0.3s GC) 7.5GB vs 8.0s (0.6s GC) 7.3GB

cmucl:
18.4s (2.9s GC) 8.1GB vs 14.2 (2.5s GC) 4.8GB

ecl:
27.7s vs 23.7s

- Qian

Qian Yun

unread,
May 15, 2026, 11:46:53 PMMay 15
to fricas-devel
Simplified the code a bit more. See attachment.

- Qian
recycle-kernel-cache-v5.patch

Qian Yun

unread,
May 19, 2026, 10:15:43 AMMay 19
to fricas-devel
Fixed 2 bugs:

1. potential dangling weak pointer left in COMPACT_WEAK_ARRAY.

2. search could insert new kernels, which could trigger compact
process, so use new function 'get_index' to handle this.

- Qian
recycle-kernel-cache-v6.patch

Qian Yun

unread,
May 21, 2026, 12:30:13 AMMay 21
to fricas-devel
Another benchmark, regarding "Slow formal derivatives",
for 'src/input/derham.input',

sbcl (before vs after):

n:=8, kernel count drop from 8866 to 187

5.5s (0.04s GC) 2.18GB
2.6s (0.33s GC) 0.69GB

n:=9, kernel count drop from 9484 to 236

9.6s (0.05s GC) 4.32GB
4.1s (0.43s GC) 1.28GB

n:=10, kernel count drop from 12976 to 292

16.3s (0.14s GC) 7.98GB
6.2s (0.52s GC) 2.23GB

BTW, I'm testing on reusing dummy variables names instead using
new symbols every time, this should further decrease the number
of new kernels.

- Qian

Waldek Hebisch

unread,
Jun 24, 2026, 9:39:08 AM (2 days ago) Jun 24
to fricas...@googlegroups.com
On Thu, May 21, 2026 at 12:30:08PM +0800, Qian Yun wrote:
> Another benchmark, regarding "Slow formal derivatives",
> for 'src/input/derham.input',
>
> sbcl (before vs after):
>
> n:=8, kernel count drop from 8866 to 187
>
> 5.5s (0.04s GC) 2.18GB
> 2.6s (0.33s GC) 0.69GB
>
> n:=9, kernel count drop from 9484 to 236
>
> 9.6s (0.05s GC) 4.32GB
> 4.1s (0.43s GC) 1.28GB
>
> n:=10, kernel count drop from 12976 to 292
>
> 16.3s (0.14s GC) 7.98GB
> 6.2s (0.52s GC) 2.23GB

Some comments.

First, I have applied your patch to my private tree. I intend to do
some heavier tests, but ATM in light testing I see no problems and
there is nice speedup on mapleok.input.

I am a bit concerned with GC time. Your timings indicate that
it is significantly higher than before. Usualy savings on
arithmetic will be bigger than increase in GC time, but in
some cases (basically when operator names differ) we can
do comparisons without arithmetic and in some cases
artithmetic may be cheap. So there is risk of increased
runtime in such cases.

ATM it seems to me that triggering expansion in geometric
progression (like the old code did) is safer than linear
increments. IIUC you use high threshold (that is 300)
so you normally get just one increment. But with really
large number of active kernels there would be a lot of
increments and lot of GC-s.

Another question is tactic that you use, namely triggering
full GC to collect unneded kernels. IIUC this optimizes
speed in case when there is moderate number of kernels
and no pressure to collect unneded kernels. But IIUC
intended use of weak pointers is that GC happen naturally
and collect data available via weak pointers. So, normal
use would be to have weak pointers all the time and check
for dangling pointer on each access. That clearly would
need more code, so possibly increase time needed to access
kernels in normal case. But is should lower GC time. In
particular, full GC is expensive and in sbcl most GC-s are
incremental, so much cheaper. I do not know if such variant
is better, but it would be good to test.

--
Waldek Hebisch

Qian Yun

unread,
Jun 24, 2026, 10:37:31 AM (2 days ago) Jun 24
to fricas...@googlegroups.com
On 6/24/26 9:39 PM, Waldek Hebisch wrote:
>
> Some comments.
>
> First, I have applied your patch to my private tree. I intend to do
> some heavier tests, but ATM in light testing I see no problems and
> there is nice speedup on mapleok.input.
>
> I am a bit concerned with GC time. Your timings indicate that
> it is significantly higher than before. Usualy savings on
> arithmetic will be bigger than increase in GC time, but in
> some cases (basically when operator names differ) we can
> do comparisons without arithmetic and in some cases
> artithmetic may be cheap. So there is risk of increased
> runtime in such cases.

With my recent changes to optimize =$EXPR, the price of
kernel comparison is cheaper, so we can tolerate for
larger kernel cache threshold, so fewer numbers of GCs,
fewer time spent in GC.

> ATM it seems to me that triggering expansion in geometric
> progression (like the old code did) is safer than linear
> increments. IIUC you use high threshold (that is 300)
> so you normally get just one increment. But with really
> large number of active kernels there would be a lot of
> increments and lot of GC-s.

The old way is "cache_size := 2*cache_size + 10":

0 -> 10 -> 30 -> 70 -> 150 -> 310 -> 630 -> 1270 -> 2550

If the gap grows to hundreds, or thousands, then it will
be a long time to trigger next kernel recycle.

I can test and compare both methods.

Also, if we assume that in a long run of computation,
latter computation does not depends on previous ones,
then the the number of kernels should grows linearly,
thus the recycle of kernels should also be done linearly.

As I said in above, because of optimization in =$EXPR,
current linear threshold could be 600 instead of 300.

> Another question is tactic that you use, namely triggering
> full GC to collect unneded kernels. IIUC this optimizes
> speed in case when there is moderate number of kernels
> and no pressure to collect unneded kernels. But IIUC
> intended use of weak pointers is that GC happen naturally
> and collect data available via weak pointers. So, normal
> use would be to have weak pointers all the time and check
> for dangling pointer on each access. That clearly would
> need more code, so possibly increase time needed to access
> kernels in normal case. But is should lower GC time. In
> particular, full GC is expensive and in sbcl most GC-s are
> incremental, so much cheaper. I do not know if such variant
> is better, but it would be good to test.
>

If we do not use full gc, the default generational GC
will not do a full recycle, causing the kernel cache to
remain relatively big, thus the time saving is not great.
(I tested that before. So instead of shrinking 3000 kernels
to 300, it could only shrink to 2000 instead.)

- Qian

Waldek Hebisch

unread,
Jun 24, 2026, 11:28:28 AM (2 days ago) Jun 24
to fricas...@googlegroups.com
On Wed, Jun 24, 2026 at 10:37:26PM +0800, Qian Yun wrote:
> On 6/24/26 9:39 PM, Waldek Hebisch wrote:
> >
> > Some comments.
> >
> > First, I have applied your patch to my private tree. I intend to do
> > some heavier tests, but ATM in light testing I see no problems and
> > there is nice speedup on mapleok.input.
> >
> > I am a bit concerned with GC time. Your timings indicate that
> > it is significantly higher than before. Usualy savings on
> > arithmetic will be bigger than increase in GC time, but in
> > some cases (basically when operator names differ) we can
> > do comparisons without arithmetic and in some cases
> > artithmetic may be cheap. So there is risk of increased
> > runtime in such cases.
>
> With my recent changes to optimize =$EXPR, the price of
> kernel comparison is cheaper, so we can tolerate for
> larger kernel cache threshold, so fewer numbers of GCs,
> fewer time spent in GC.

Part of my worry is looking into future. On bigger machines
we probably can handle more kernels. Fixed threshold can
be quite good today, but is likely to be suboptimal in
changed use. Good adaptive strategy has chance of working
with over much wider range of parameters.

> > ATM it seems to me that triggering expansion in geometric
> > progression (like the old code did) is safer than linear
> > increments. IIUC you use high threshold (that is 300)
> > so you normally get just one increment. But with really
> > large number of active kernels there would be a lot of
> > increments and lot of GC-s.
>
> The old way is "cache_size := 2*cache_size + 10":
>
> 0 -> 10 -> 30 -> 70 -> 150 -> 310 -> 630 -> 1270 -> 2550
>
> If the gap grows to hundreds, or thousands, then it will
> be a long time to trigger next kernel recycle.
>
> I can test and compare both methods.
>
> Also, if we assume that in a long run of computation,
> latter computation does not depends on previous ones,
> then the the number of kernels should grows linearly,
> thus the recycle of kernels should also be done linearly.

Typical reasoning uses notion of "working set". We want
whole working set in the cache and we want to collect
kernels not in the working set. We do not know how
big is the working set, so we approvimate size by say
power of 2. If we get above our current estimate of
the size we collect. If after collection too many
kernels survive we increase the estimate, if much
smaller number survive we decrese the estimate.

Assuming that number of kernels is proportional to size
of all date cost of GC will be proportinal to number
of kernels. Using linear strategy to trigger GC
gives quagratic run time. OTOH increasing number of
kernels in cache by constant factor should only
modestly increase search time. That is when binary
search works it is just 1 or two extra comparisons.
When using linear search number of comparisons
increases by the same factor as cache size. So
bigger cache means modest increase in search time.
But it cuts number of garbage collections so that total
time spent on garbage collection remains linear.

> As I said in above, because of optimization in =$EXPR,
> current linear threshold could be 600 instead of 300.
>
> > Another question is tactic that you use, namely triggering
> > full GC to collect unneded kernels. IIUC this optimizes
> > speed in case when there is moderate number of kernels
> > and no pressure to collect unneded kernels. But IIUC
> > intended use of weak pointers is that GC happen naturally
> > and collect data available via weak pointers. So, normal
> > use would be to have weak pointers all the time and check
> > for dangling pointer on each access. That clearly would
> > need more code, so possibly increase time needed to access
> > kernels in normal case. But is should lower GC time. In
> > particular, full GC is expensive and in sbcl most GC-s are
> > incremental, so much cheaper. I do not know if such variant
> > is better, but it would be good to test.
> >
>
> If we do not use full gc, the default generational GC
> will not do a full recycle, causing the kernel cache to
> remain relatively big, thus the time saving is not great.
> (I tested that before. So instead of shrinking 3000 kernels
> to 300, it could only shrink to 2000 instead.)

But did you try approach suggested above? IIUC you have
weak pointers only when you decide if you need to expand
the cache. It is likely that there were several incremental
GC between time that kernel is created and time when
you try to GC kernels. If a piece of data sorvives few
(maybe only 2) incremental GC it is promoted to stable
pool and only collected by full GC. Note: I am writing
"incremental" and "full" as if there were only two
possibilities. In general generational GC may have
several generations, but this does not change the
principle: if you do not collect garbage prompty, then
you need to do more work to collect it.

Using weak pointers all the time means that kernels can by
collected during incremental GC.
--
Waldek Hebisch

Qian Yun

unread,
Jun 24, 2026, 7:24:53 PM (2 days ago) Jun 24
to fricas...@googlegroups.com
On 6/24/26 11:28 PM, Waldek Hebisch wrote:
>
> Part of my worry is looking into future. On bigger machines
> we probably can handle more kernels. Fixed threshold can
> be quite good today, but is likely to be suboptimal in
> changed use. Good adaptive strategy has chance of working
> with over much wider range of parameters.
>

I think the constant threshold is already adaptive:
let's say threshold is 1000. The majority of cost should
not be binary search, but the linear search when inserting
new kernels.

So when kernel count grows from 0 to 1000, half a million
comparison is needed. From 1000 to 2000, 1.5 million comparison.

So when the kernel count is large, not doing kernel recycle
has a larger cost. So the constant threshold makes more
sense than exponential one.

Also, full gc is not that expensive, certainly cheaper than
millions of comparison, which itself could use enough memory
to trigger gc.

>
> But did you try approach suggested above? IIUC you have
> weak pointers only when you decide if you need to expand
> the cache. It is likely that there were several incremental
> GC between time that kernel is created and time when
> you try to GC kernels. If a piece of data sorvives few
> (maybe only 2) incremental GC it is promoted to stable
> pool and only collected by full GC. Note: I am writing
> "incremental" and "full" as if there were only two
> possibilities. In general generational GC may have
> several generations, but this does not change the
> principle: if you do not collect garbage prompty, then
> you need to do more work to collect it.
>
> Using weak pointers all the time means that kernels can by
> collected during incremental GC.

OK, so you want to utilize the normal gc (gc triggered by
computation elsewhere) to do the recycle, instead of the
manually triggered gc.

One potential problem is the dangling weak pointer could
complicate the binary search process.

Anyway, I doubt that GC time will cause the recycle
kernels approach to be noticeably slower, even in the
case of repeatedly creating different simple new kernels.

- Qian

Reply all
Reply to author
Forward
0 new messages