[PATCH] v4 recycle kernel cache via weak pointer

Qian Yun

unread,

May 15, 2026, 12:28:23 AMMay 15

to fricas-devel

Added support for clisp, cmucl, ecl, sbcl.

Benchmark (execution time and memory footprint) for mapleok.input:
(after patch vs before patch)

====
)time long
)storage long
systemCommand "read src/input/mapleok.input"
====

sbcl:
10.2s 7.5GB vs 14.4s 13.5GB

cmucl:
17.4s 5.3GB vs 24.9s 8.3GB

clisp:
132s 7.7GB vs 222s 13.7GB

ecl:
46s vs 79s

I've added debugging facility, so that you can test its effects
without compiling twice:

by using "setExpandCacheThreshold(6000)$SCACHE Kernel EXPR INT"
you effectively disable recycle on kernel cache.

I'll try to find more benchmarks as well.

BTW, lispworks and clozurecl doesn't have weak pointers,
but lispworks has weak hashtable and clozurecl has weak array.
They can be implemented, but a bit more complicated.

- Qian

recycle-kernel-cache-v4.patch

Qian Yun

unread,

May 15, 2026, 5:54:48 AMMay 15

to fricas-devel

Benchmark on integ.input (replace the 2 "testcase" with
"testcaseNoClear"):

Kernels drop from 4461 to 831 with patch.

(before vs after)

clisp:
78.6s (18.0s GC) 4.4GB vs 70.1s (16.8s GC) 4.2GB

sbcl:
8.9s (0.3s GC) 7.5GB vs 8.0s (0.6s GC) 7.3GB

cmucl:
18.4s (2.9s GC) 8.1GB vs 14.2 (2.5s GC) 4.8GB

ecl:
27.7s vs 23.7s

- Qian

Qian Yun

unread,

May 15, 2026, 11:46:53 PMMay 15

to fricas-devel

Simplified the code a bit more. See attachment.

- Qian

recycle-kernel-cache-v5.patch

Qian Yun

unread,

May 19, 2026, 10:15:43 AMMay 19

to fricas-devel

Fixed 2 bugs:

1. potential dangling weak pointer left in COMPACT_WEAK_ARRAY.

2. search could insert new kernels, which could trigger compact
process, so use new function 'get_index' to handle this.

- Qian

recycle-kernel-cache-v6.patch

Qian Yun

unread,

May 21, 2026, 12:30:13 AMMay 21

to fricas-devel

Another benchmark, regarding "Slow formal derivatives",
for 'src/input/derham.input',

sbcl (before vs after):

n:=8, kernel count drop from 8866 to 187

5.5s (0.04s GC) 2.18GB
2.6s (0.33s GC) 0.69GB

n:=9, kernel count drop from 9484 to 236

9.6s (0.05s GC) 4.32GB
4.1s (0.43s GC) 1.28GB

n:=10, kernel count drop from 12976 to 292

16.3s (0.14s GC) 7.98GB
6.2s (0.52s GC) 2.23GB

BTW, I'm testing on reusing dummy variables names instead using
new symbols every time, this should further decrease the number
of new kernels.

- Qian

Waldek Hebisch

unread,

Jun 24, 2026, 9:39:08 AMJun 24

to fricas...@googlegroups.com

On Thu, May 21, 2026 at 12:30:08PM +0800, Qian Yun wrote:
> Another benchmark, regarding "Slow formal derivatives",
> for 'src/input/derham.input',
>
> sbcl (before vs after):
>
> n:=8, kernel count drop from 8866 to 187
>
> 5.5s (0.04s GC) 2.18GB
> 2.6s (0.33s GC) 0.69GB
>
> n:=9, kernel count drop from 9484 to 236
>
> 9.6s (0.05s GC) 4.32GB
> 4.1s (0.43s GC) 1.28GB
>
> n:=10, kernel count drop from 12976 to 292
>
> 16.3s (0.14s GC) 7.98GB
> 6.2s (0.52s GC) 2.23GB

Some comments.

First, I have applied your patch to my private tree. I intend to do
some heavier tests, but ATM in light testing I see no problems and
there is nice speedup on mapleok.input.

I am a bit concerned with GC time. Your timings indicate that
it is significantly higher than before. Usualy savings on
arithmetic will be bigger than increase in GC time, but in
some cases (basically when operator names differ) we can
do comparisons without arithmetic and in some cases
artithmetic may be cheap. So there is risk of increased
runtime in such cases.

ATM it seems to me that triggering expansion in geometric
progression (like the old code did) is safer than linear
increments. IIUC you use high threshold (that is 300)
so you normally get just one increment. But with really
large number of active kernels there would be a lot of
increments and lot of GC-s.

Another question is tactic that you use, namely triggering
full GC to collect unneded kernels. IIUC this optimizes
speed in case when there is moderate number of kernels
and no pressure to collect unneded kernels. But IIUC
intended use of weak pointers is that GC happen naturally
and collect data available via weak pointers. So, normal
use would be to have weak pointers all the time and check
for dangling pointer on each access. That clearly would
need more code, so possibly increase time needed to access
kernels in normal case. But is should lower GC time. In
particular, full GC is expensive and in sbcl most GC-s are
incremental, so much cheaper. I do not know if such variant
is better, but it would be good to test.

--
Waldek Hebisch

Qian Yun

unread,

Jun 24, 2026, 10:37:31 AMJun 24

to fricas...@googlegroups.com

On 6/24/26 9:39 PM, Waldek Hebisch wrote:
>
> Some comments.
>
> First, I have applied your patch to my private tree. I intend to do
> some heavier tests, but ATM in light testing I see no problems and
> there is nice speedup on mapleok.input.
>
> I am a bit concerned with GC time. Your timings indicate that
> it is significantly higher than before. Usualy savings on
> arithmetic will be bigger than increase in GC time, but in
> some cases (basically when operator names differ) we can
> do comparisons without arithmetic and in some cases
> artithmetic may be cheap. So there is risk of increased
> runtime in such cases.

With my recent changes to optimize =$EXPR, the price of
kernel comparison is cheaper, so we can tolerate for
larger kernel cache threshold, so fewer numbers of GCs,
fewer time spent in GC.

> ATM it seems to me that triggering expansion in geometric
> progression (like the old code did) is safer than linear
> increments. IIUC you use high threshold (that is 300)
> so you normally get just one increment. But with really
> large number of active kernels there would be a lot of
> increments and lot of GC-s.

The old way is "cache_size := 2*cache_size + 10":

0 -> 10 -> 30 -> 70 -> 150 -> 310 -> 630 -> 1270 -> 2550

If the gap grows to hundreds, or thousands, then it will
be a long time to trigger next kernel recycle.

I can test and compare both methods.

Also, if we assume that in a long run of computation,
latter computation does not depends on previous ones,
then the the number of kernels should grows linearly,
thus the recycle of kernels should also be done linearly.

As I said in above, because of optimization in =$EXPR,
current linear threshold could be 600 instead of 300.

> Another question is tactic that you use, namely triggering
> full GC to collect unneded kernels. IIUC this optimizes
> speed in case when there is moderate number of kernels
> and no pressure to collect unneded kernels. But IIUC
> intended use of weak pointers is that GC happen naturally
> and collect data available via weak pointers. So, normal
> use would be to have weak pointers all the time and check
> for dangling pointer on each access. That clearly would
> need more code, so possibly increase time needed to access
> kernels in normal case. But is should lower GC time. In
> particular, full GC is expensive and in sbcl most GC-s are
> incremental, so much cheaper. I do not know if such variant
> is better, but it would be good to test.
>

If we do not use full gc, the default generational GC
will not do a full recycle, causing the kernel cache to
remain relatively big, thus the time saving is not great.
(I tested that before. So instead of shrinking 3000 kernels
to 300, it could only shrink to 2000 instead.)

- Qian

Waldek Hebisch

unread,

Jun 24, 2026, 11:28:28 AMJun 24

to fricas...@googlegroups.com

On Wed, Jun 24, 2026 at 10:37:26PM +0800, Qian Yun wrote:
> On 6/24/26 9:39 PM, Waldek Hebisch wrote:
> >
> > Some comments.
> >
> > First, I have applied your patch to my private tree. I intend to do
> > some heavier tests, but ATM in light testing I see no problems and
> > there is nice speedup on mapleok.input.
> >
> > I am a bit concerned with GC time. Your timings indicate that
> > it is significantly higher than before. Usualy savings on
> > arithmetic will be bigger than increase in GC time, but in
> > some cases (basically when operator names differ) we can
> > do comparisons without arithmetic and in some cases
> > artithmetic may be cheap. So there is risk of increased
> > runtime in such cases.
>
> With my recent changes to optimize =$EXPR, the price of
> kernel comparison is cheaper, so we can tolerate for
> larger kernel cache threshold, so fewer numbers of GCs,
> fewer time spent in GC.

Part of my worry is looking into future. On bigger machines
we probably can handle more kernels. Fixed threshold can
be quite good today, but is likely to be suboptimal in
changed use. Good adaptive strategy has chance of working
with over much wider range of parameters.

> > ATM it seems to me that triggering expansion in geometric
> > progression (like the old code did) is safer than linear
> > increments. IIUC you use high threshold (that is 300)
> > so you normally get just one increment. But with really
> > large number of active kernels there would be a lot of
> > increments and lot of GC-s.
>
> The old way is "cache_size := 2*cache_size + 10":
>
> 0 -> 10 -> 30 -> 70 -> 150 -> 310 -> 630 -> 1270 -> 2550
>
> If the gap grows to hundreds, or thousands, then it will
> be a long time to trigger next kernel recycle.
>
> I can test and compare both methods.
>
> Also, if we assume that in a long run of computation,
> latter computation does not depends on previous ones,
> then the the number of kernels should grows linearly,
> thus the recycle of kernels should also be done linearly.

Typical reasoning uses notion of "working set". We want
whole working set in the cache and we want to collect
kernels not in the working set. We do not know how
big is the working set, so we approvimate size by say
power of 2. If we get above our current estimate of
the size we collect. If after collection too many
kernels survive we increase the estimate, if much
smaller number survive we decrese the estimate.

Assuming that number of kernels is proportional to size
of all date cost of GC will be proportinal to number
of kernels. Using linear strategy to trigger GC
gives quagratic run time. OTOH increasing number of
kernels in cache by constant factor should only
modestly increase search time. That is when binary
search works it is just 1 or two extra comparisons.
When using linear search number of comparisons
increases by the same factor as cache size. So
bigger cache means modest increase in search time.
But it cuts number of garbage collections so that total
time spent on garbage collection remains linear.

> As I said in above, because of optimization in =$EXPR,
> current linear threshold could be 600 instead of 300.
>
> > Another question is tactic that you use, namely triggering
> > full GC to collect unneded kernels. IIUC this optimizes
> > speed in case when there is moderate number of kernels
> > and no pressure to collect unneded kernels. But IIUC
> > intended use of weak pointers is that GC happen naturally
> > and collect data available via weak pointers. So, normal
> > use would be to have weak pointers all the time and check
> > for dangling pointer on each access. That clearly would
> > need more code, so possibly increase time needed to access
> > kernels in normal case. But is should lower GC time. In
> > particular, full GC is expensive and in sbcl most GC-s are
> > incremental, so much cheaper. I do not know if such variant
> > is better, but it would be good to test.
> >
>
> If we do not use full gc, the default generational GC
> will not do a full recycle, causing the kernel cache to
> remain relatively big, thus the time saving is not great.
> (I tested that before. So instead of shrinking 3000 kernels
> to 300, it could only shrink to 2000 instead.)

But did you try approach suggested above? IIUC you have
weak pointers only when you decide if you need to expand
the cache. It is likely that there were several incremental
GC between time that kernel is created and time when
you try to GC kernels. If a piece of data sorvives few
(maybe only 2) incremental GC it is promoted to stable
pool and only collected by full GC. Note: I am writing
"incremental" and "full" as if there were only two
possibilities. In general generational GC may have
several generations, but this does not change the
principle: if you do not collect garbage prompty, then
you need to do more work to collect it.

Using weak pointers all the time means that kernels can by
collected during incremental GC.
--
Waldek Hebisch

Qian Yun

unread,

Jun 24, 2026, 7:24:53 PMJun 24

to fricas...@googlegroups.com

On 6/24/26 11:28 PM, Waldek Hebisch wrote:
>
> Part of my worry is looking into future. On bigger machines
> we probably can handle more kernels. Fixed threshold can
> be quite good today, but is likely to be suboptimal in
> changed use. Good adaptive strategy has chance of working
> with over much wider range of parameters.
>

I think the constant threshold is already adaptive:
let's say threshold is 1000. The majority of cost should
not be binary search, but the linear search when inserting
new kernels.

So when kernel count grows from 0 to 1000, half a million
comparison is needed. From 1000 to 2000, 1.5 million comparison.

So when the kernel count is large, not doing kernel recycle
has a larger cost. So the constant threshold makes more
sense than exponential one.

Also, full gc is not that expensive, certainly cheaper than
millions of comparison, which itself could use enough memory
to trigger gc.

>
> But did you try approach suggested above? IIUC you have
> weak pointers only when you decide if you need to expand
> the cache. It is likely that there were several incremental
> GC between time that kernel is created and time when
> you try to GC kernels. If a piece of data sorvives few
> (maybe only 2) incremental GC it is promoted to stable
> pool and only collected by full GC. Note: I am writing
> "incremental" and "full" as if there were only two
> possibilities. In general generational GC may have
> several generations, but this does not change the
> principle: if you do not collect garbage prompty, then
> you need to do more work to collect it.
>
> Using weak pointers all the time means that kernels can by
> collected during incremental GC.

OK, so you want to utilize the normal gc (gc triggered by
computation elsewhere) to do the recycle, instead of the
manually triggered gc.

One potential problem is the dangling weak pointer could
complicate the binary search process.

Anyway, I doubt that GC time will cause the recycle
kernels approach to be noticeably slower, even in the
case of repeatedly creating different simple new kernels.

- Qian

Qian Yun

unread,

Jun 26, 2026, 8:20:57 AMJun 26

to fricas...@googlegroups.com

On 6/25/26 7:24 AM, Qian Yun wrote:
> On 6/24/26 11:28 PM, Waldek Hebisch wrote:
>>
>> Part of my worry is looking into future. On bigger machines
>> we probably can handle more kernels. Fixed threshold can
>> be quite good today, but is likely to be suboptimal in
>> changed use. Good adaptive strategy has chance of working
>> with over much wider range of parameters.
>>
>
> I think the constant threshold is already adaptive:
> let's say threshold is 1000. The majority of cost should
> not be binary search, but the linear search when inserting
> new kernels.
>
> So when kernel count grows from 0 to 1000, half a million
> comparison is needed. From 1000 to 2000, 1.5 million comparison.
>
> So when the kernel count is large, not doing kernel recycle
> has a larger cost. So the constant threshold makes more
> sense than exponential one.
>
> Also, full gc is not that expensive, certainly cheaper than
> millions of comparison, which itself could use enough memory
> to trigger gc.

Another thought on adaptive:

Say the kernel cache reaches 3000 (n1) and after recycle
it remains 800 (n2), then the threshold could be set to
the double of that -- 1600 (2*n2). (Or another multiplier.)

To prevent frequent full gc when the kernel cache is small,
it could be set to min(2*n2, 600).

>>
>> But did you try approach suggested above? IIUC you have
>> weak pointers only when you decide if you need to expand
>> the cache. It is likely that there were several incremental
>> GC between time that kernel is created and time when
>> you try to GC kernels. If a piece of data sorvives few
>> (maybe only 2) incremental GC it is promoted to stable
>> pool and only collected by full GC. Note: I am writing
>> "incremental" and "full" as if there were only two
>> possibilities. In general generational GC may have
>> several generations, but this does not change the
>> principle: if you do not collect garbage prompty, then
>> you need to do more work to collect it.
>>
>> Using weak pointers all the time means that kernels can by
>> collected during incremental GC.
>
> OK, so you want to utilize the normal gc (gc triggered by
> computation elsewhere) to do the recycle, instead of the
> manually triggered gc.
>
> One potential problem is the dangling weak pointer could
> complicate the binary search process.
>

One way to prevent dangling weak pointer would be calling
a function in post-gc-hook to remove those after gc.

Not all common lisp have "post-gc-hook", only sbcl/cmucl/ccl.

This could be implemented as an additional feature,
as it is not universal. It could save a few full gc calls,
but not much, I guess.

- Qian

Waldek Hebisch

unread,

Jun 26, 2026, 10:24:10 AMJun 26

to fricas...@googlegroups.com

On Fri, Jun 26, 2026 at 08:20:52PM +0800, Qian Yun wrote:
> On 6/25/26 7:24 AM, Qian Yun wrote:
> > On 6/24/26 11:28 PM, Waldek Hebisch wrote:
> >>
> >> Part of my worry is looking into future. On bigger machines
> >> we probably can handle more kernels. Fixed threshold can
> >> be quite good today, but is likely to be suboptimal in
> >> changed use. Good adaptive strategy has chance of working
> >> with over much wider range of parameters.
> >>
> >
> > I think the constant threshold is already adaptive:
> > let's say threshold is 1000. The majority of cost should
> > not be binary search, but the linear search when inserting
> > new kernels.
> >
> > So when kernel count grows from 0 to 1000, half a million
> > comparison is needed. From 1000 to 2000, 1.5 million comparison.
> >
> > So when the kernel count is large, not doing kernel recycle
> > has a larger cost. So the constant threshold makes more
> > sense than exponential one.

No. Which searches to do depends only on the program. Cost of
the search depends on number of kernels in the cache at time
of the search. Doubling number of kernels should lead to
at most doubling search time. OTOH cost of full GC depends on
total size of data. Assiming that total size is proportional
to number of kernels linearly increasing threshold will lead
to linearly growing number of full GC-s, that is quadratic
cost. Currently you are using large starting treshold so you
need quite big computaion to see this growth, but linear
increase of threshold will not scale.

> > Also, full gc is not that expensive, certainly cheaper than
> > millions of comparison, which itself could use enough memory
> > to trigger gc.

Not "certainly". By in normal build using sbcl we have relativly
low limit on total size. Still, whit 500 Mb of total data full
GC is likely to need 500 Mb copy. If you compare plain symbols
comparisons are cheap and millions of comparisons should be
doable in time needed for copy. And I have done computations
that used tens of Gb of memory.

mapleok.input is somewhat atypical, it contains a lot of tests,
each independent from others. Since the integrals there are
definite integrals, there is a lot of limit computations
which tends to produce significant number of new kernels.
So rate of producing new kernels is high while data size on
average is quite modest. In real computation much more data
would stay live. Data you showed indicate that cost of
increases significantly and is nontrivial. In the specific
case there still win from having less comparisons, but
with more live data the opposite could be true.

Having said the above, I understand that in case of mapleok
it is preferable to do more frequent GC. So factor of 2
may be too skewed towards saving GC time. But no law of
nature says to 2 is correct factor. Other factor, for example
1.5 may be better. Or, as I have suggested collect kernels
mostly as byproduct of normal GC.

> Another thought on adaptive:
>
> Say the kernel cache reaches 3000 (n1) and after recycle
> it remains 800 (n2), then the threshold could be set to
> the double of that -- 1600 (2*n2). (Or another multiplier.)

Yes, the idea is that we recycle first and only after that
we decide if we need to expand (or possibly shrink).

I do not think dangling weak pointers are very hard to handle.
Linear search can simply skip them. Binary search should
use one of neigbours for comparison, but track bounds so that
position of dangling pointer is excluded from later iterations.
We already have code to account for kernels inserted during
search. AFAICS handling of dangling weak pointers is needed
in the same places, leading to managable complication.

--
Waldek Hebisch

Qian Yun

unread,

Jun 26, 2026, 10:49:55 AMJun 26

to fricas...@googlegroups.com

This is where we disagree. You consider the time consuming part
in Kernel is binary search, which is log(n). I consider most
time spent in Kernel is during creation, aka the linear search part.

Let's say we are about to do some computation, which requires to
create 200 new kernels.

If current kernel cache length is 1000, then 1000*1200/2=0.6 million
kernel comparison ("triage") are needed.

If current kernel cache length is 10000, then 10000*10200/2=51 million
kernel comparison ("triage") are needed.

Compared to that, a few full GC is not significant at all.

- Qian

Qian Yun

unread,

Jun 26, 2026, 11:07:06 AMJun 26

to fricas...@googlegroups.com

Oops, the numbers should be:
(1000+1200)*200/2 = 220K vs. (10000+10200)*200/2=2.02M

- Qian

Waldek Hebisch

unread,

Jun 26, 2026, 11:50:36 AMJun 26

to fricas...@googlegroups.com

I do not see how do you arrive at those numbers. Inserting 200
new kernels requires 200 searches. Since kernels are new the
searches are linear searches. So, in the first case linear
search costs at most 1000 comparisons, we get 1000*200 in the
first case (and 10000*200 in the second case).

Maybe you are looking at total number of comparisons to get
to given level? In such case numbers are slightly different
but that would be the case before your patch, that is with
_no_ removal of unneded kernels.

Anyway, consider absurdly large number of kernels, say
1000000. Add say 1200. With linear theshold of 600
you will do 2 GC. Assume that first GC was a succes
and trimmed cache back to 1000000. So the cost is
proportional to 2*(2*1000000 + 300)*300/2. With theshold
1200 cost is (2*1000000 + 600)*600/2. The difference is
600*300/2. Compared to total this is 0.00015. That is
second variant is more expensive by factor of 0.00015.
OTOH second variant is doing half of GC-s of the first
variant.

> Compared to that, a few full GC is not significant at all.

Having 10000 kernels in cache when we need 1000, that is 10
times as much as needed clearly leads to significant (that is
10 times) increase in search time. If we have 2 times more
then needed then it is 2 times increase. Full GC has cost
proportinal to total size of live data and size may be quite
nontrivial. You are right that behaviour of kernel cache may
be quite bad for large number of kernels, but assuming that GC
cost does not matter in comparison is not valid. There is a
tradeoff depending on amount of data and cost of comparisons.
In some cases comparisons are cheap and in those cases
binary search is valid, so in principle we could make
those cases much cheaper.

--
Waldek Hebisch

Qian Yun

unread,

Jun 26, 2026, 7:35:40 PMJun 26

to fricas...@googlegroups.com

Notice that the difference is irrelevant to the total (1000000).
We should not look at the ratio (0.00015), instead the
absolute value of difference, it determines the extra time.
So on one hand, we have 90k comparisons, on the other hand,
we have an extra full GC.

So the above is an argument for linear threshold!

>> Compared to that, a few full GC is not significant at all.
>
> Having 10000 kernels in cache when we need 1000, that is 10
> times as much as needed clearly leads to significant (that is
> 10 times) increase in search time. If we have 2 times more
> then needed then it is 2 times increase. Full GC has cost
> proportinal to total size of live data and size may be quite
> nontrivial. You are right that behaviour of kernel cache may
> be quite bad for large number of kernels, but assuming that GC
> cost does not matter in comparison is not valid. There is a
> tradeoff depending on amount of data and cost of comparisons.
> In some cases comparisons are cheap and in those cases
> binary search is valid, so in principle we could make
> those cases much cheaper.
>

Well, when kernel cache grows linearly, the total comparisons
needed are growing by quadratic, let's assume the heap size to
grow linearly. Also the heap size has an upper limit, the
default SBCL heap is 4GB by our configuration.

Let's take [sin x for x in 1..10^6] for example, I estimate it
to be finished in 70000s. Adding kernel cache recycle will not
help it because nothing will be recycled. For linear threshold
600, that's 1666 full GCs, estimated to take 100s.
So 0.14% overhead.

- Qian

Reply all

Reply to author

Forward