[Sbcl-devel] Failure on Arm32 with 78f6a97560f6c2a13f37b6933f11a5ffa7057440

19 views
Skip to first unread message

Bruce O'Neel

unread,
Oct 17, 2023, 1:07:47 PM10/17/23
to SBCL Devel
Hi,

I get the following error on ARM32.

; compilation finished in 0:00:05.608
; compiling file "src/code/external-formats/enc-win.lisp" (written 17 OCT 2023 03:13:01 PM):
fatal error encountered in SBCL pid 6164:
unboxed object in scavenge_control_stack: 0xf73114e4->2, start=0xf7310000, end=0xf7313704

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
(GC in progress, oldspace=0, newspace=1)
ldb>        

Git bisect claims it is this change.

Thanks!
cheers
bruce

$ git bisect good                                     
78f6a97560f6c2a13f37b6933f11a5ffa7057440 is the first bad commit               
commit 78f6a97560f6c2a13f37b6933f11a5ffa7057440                                
Author: Stas Boukarev <stas...@gmail.com>                                     
Date:   Tue Oct 17 02:35:07 2023 +0300                                         
                                                                               
    Delay transforming irrational functions to alien calls.                    
                                                                               
    To produce better type derivation after constraint propagation.            
                                                                               
float-math.lisp-expr         | 2 ++                                           
src/compiler/float-tran.lisp | 8 +++++---                                     
tests/float-2.pure.lisp      | 7 ++++++-                                      
3 files changed, 13 insertions(+), 4 deletions(-)     

Stas Boukarev

unread,
Oct 17, 2023, 1:29:41 PM10/17/23
to Bruce O'Neel, SBCL Devel
I can't reproduce that. And it's unlikely that calls to LOG might have caused a gc problem, it's probably non-deterministic.

_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Bruce O'Neel

unread,
Oct 17, 2023, 2:25:43 PM10/17/23
to Stas Boukarev, SBCL Devel
Hi,

The following works for me:

--------------

diff --git a/src/compiler/float-tran.lisp b/src/compiler/float-tran.lisp
index ec5acf1ce..86f78c33f 100644
--- a/src/compiler/float-tran.lisp
+++ b/src/compiler/float-tran.lisp
@@ -482,7 +482,7 @@
              `(progn
                (deftransform ,name ((x) (single-float) ,rtype :node node)
                  (delay-ir1-transform node :ir1-phases)
-                 `(%single-float (,',prim (%double-float x))))
+                 `(coerce (,',prim (coerce x 'double-float)) 'single-float))
                (deftransform ,name ((x) (double-float) ,rtype :node node)
                  (delay-ir1-transform node :ir1-phases)
                  `(,',prim x)))))

------------

As you can see I kept the delay-ir1-transform but restored the original more complex code.  This is kind of a monkey substitution since I don't really understand what is going on here.  I tested also on x86-64 and it doesn't seem to break.

cheers

bruce

Stas Boukarev

unread,
Oct 17, 2023, 3:36:43 PM10/17/23
to Bruce O'Neel, SBCL Devel
That even further demonstrates that LOG itself is not the problem.

Douglas Katzman via Sbcl-devel

unread,
Oct 17, 2023, 4:39:55 PM10/17/23
to Stas Boukarev, SBCL Devel
Could it be a storage-class mixup in a vop ?

Stas Boukarev

unread,
Oct 17, 2023, 4:41:06 PM10/17/23
to Douglas Katzman, SBCL Devel
Quite possible. But, since I can't reproduce it locally, I have no idea how to diagnose this.

Stas Boukarev

unread,
Oct 17, 2023, 7:49:55 PM10/17/23
to Douglas Katzman, SBCL Devel
I have a crash in layouts.pure when running everything, but not when testing just layouts.pure.lisp. But it's not from the stack.

Stas Boukarev

unread,
Oct 17, 2023, 8:27:58 PM10/17/23
to Douglas Katzman, SBCL Devel
It crashes after gc-smoketest.pure.lisp which does layoutless-instance-no-crash. So that's probably not related.
./run-tests.sh gc-smoketest.pure.lisp layouts.pure.lisp crashes on x86-64 too.

Christophe Rhodes

unread,
Oct 18, 2023, 5:17:04 PM10/18/23
to Stas Boukarev, SBCL Devel
Stas Boukarev <stas...@gmail.com> writes:

> I can't reproduce that. And it's unlikely that calls to LOG might have
> caused a gc problem, it's probably non-deterministic.

I tried reproducing today and failed too. I did have to mess with GCC
-march command-line options to get the system to build at all,
particularly with respect to convincing the C compiler that it should
compile with floating point instructions. Is it possible that there
might be a mismatch between the way the SBCL runtime is compiled and the
way the C library expects floating point arguments?

Cheers,

Christophe

Stas Boukarev

unread,
Oct 18, 2023, 5:33:56 PM10/18/23
to Christophe Rhodes, SBCL Devel
Well, there's no floating point arguments involved, since just changing (%single-float x) to (coerce x 'single-float) stops the error from appearing, and (coerce x 'single-float) is transformed to %single-float.
Transforming less stuff influences the frequency of GCing.

Stas Boukarev

unread,
Oct 19, 2023, 5:50:58 PM10/19/23
to Douglas Katzman, SBCL Devel
It's hard for me to imagine a bad sc actually going to the stack. Choosing a wrong register happens pretty often, that would be hard to catch.

On Fri, Oct 20, 2023 at 12:34 AM Douglas Katzman <do...@google.com> wrote:
Is there a relatively uniform way - such as injecting code into all vops that can possibly store to the control stack - to put in a debugging-only assertion that the value 2 is never stored? It would be massively slow, but it certainly appears that some vop is capable of doing just that.


Douglas Katzman via Sbcl-devel

unread,
Oct 19, 2023, 5:59:53 PM10/19/23
to Bruce O'Neel, SBCL Devel

Can you see if anything is different when this diff is applied? And I infer that the error is perfectly repeatable?
Since it's not a heap invariant loss, I doubt the diff will reveal anything, but I would always start there for GC errors.

--- a/src/runtime/gc-common.c
+++ b/src/runtime/gc-common.c
@@ -2776,10 +2776,10 @@ uword_t primitive_object_size(lispobj ptr) {
 /* We hunt for pointers to old-space, when GCing generations >= verify_gen.
  * Set verify_gens to HIGHEST_NORMAL_GENERATION + 2 to disable this kind of
  * check. */
-generation_index_t verify_gens = HIGHEST_NORMAL_GENERATION + 2;
+generation_index_t verify_gens = 0;
 
 /* Should we do a pre-scan of the heap before it's GCed? */
-int pre_verify_gen_0 = 0;
+int pre_verify_gen_0 = 1;
 

Douglas Katzman via Sbcl-devel

unread,
Oct 19, 2023, 6:00:02 PM10/19/23
to Stas Boukarev, SBCL Devel

Bruce O'Neel

unread,
Oct 20, 2023, 5:18:10 AM10/20/23
to Douglas Katzman, SBCL Devel
Hi,

In my case it is every build after the above change.

This is a PI/400 which is running the Debian 11.7 of Raspberry PI OS from May.  I'll soon update to the Debian 12 version from October.

One consequence of this is that even though I installed the 32 bit version, the kernel is 64 bit.   Therefore when I build sbcl I type

time SBCL_ARCH=arm sh make.sh

because uname -m and friends all return aarch64.

And thanks for the patch, it does build correctly now on

2.3.9.80-78f6a9756

and

2.3.9.83-139f0d205

cheers

bruce

Douglas Katzman via Sbcl-devel

unread,
Oct 20, 2023, 8:56:21 AM10/20/23
to Bruce O'Neel, SBCL Devel
But the patch should not have fixed any, only diagnosed. This suggests it reproducibility for sure

Stas Boukarev

unread,
Oct 20, 2023, 9:02:43 AM10/20/23
to Douglas Katzman, SBCL Devel
I tried injecting GC at short intervals, but arm32 doesn't appear to be interrupt safe, I'm getting PC pointing into random locations.

Bruce O'Neel

unread,
Oct 20, 2023, 12:18:02 PM10/20/23
to Douglas Katzman, SBCL Devel
Hi,

So I have the full log of a build.  What should I be looking for as an error message?

cheers
bruce

Douglas Katzman via Sbcl-devel

unread,
Oct 20, 2023, 12:25:47 PM10/20/23
to Bruce O'Neel, SBCL Devel
I would expect an error that is different from "unboxed object in scavenge_control_stack:"
Is it just magically working to completion without your patch to float-tran but with the debugging enabled? If so, it is really surprising, because adding debugging does not alter any action the GC takes in collection per se. At least, it never has.
Can you try with only one of the variables changed from its non-default value to see which one is the miraculous one (if it's one)?

Bruce O'Neel

unread,
Oct 20, 2023, 2:04:19 PM10/20/23
to Douglas Katzman, SBCL Devel
Hi,

Sorry, I must of made a mistake.

With the gc-common patch I get this:

; wrote /home/edoneel/tmp/sbcl/obj/from-host/src/compiler/locall.fasl-tmp
; compilation finished in 0:00:00.996
; compiling file "/home/edoneel/tmp/sbcl/src/compiler/ir1opt.lisp" (written 17 OCT 2023 03:13:01 PM):
fatal error encountered in SBCL pid 30567:
unexpected forwarding pointer in scavenge @ 0xffce0714

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
(GC in progress, oldspace=0, newspace=1)
ldb>

Thanks.

bruce

Douglas Katzman via Sbcl-devel

unread,
Oct 20, 2023, 2:14:41 PM10/20/23
to Bruce O'Neel, SBCL Devel
That is a GC error during make-host-1 which means the host compiler was already built wrong, and if that wrong host built a target, then the target could be built wrong as well.
Are you able to produce the "new" error (the "unexpected object" error during compilation of enc-win, which happens in make-target-2) from a lisp build with a host that is definitely good?  There are more moving pieces than previously suspected.

Bruce O'Neel

unread,
Oct 21, 2023, 1:47:47 PM10/21/23
to Douglas Katzman, SBCL Devel
So in trying to do something more ordered:

1. Built 2.6.9 from source and installed.
2. Rebuilt 2.6.9 from source again and re-installed.  Then started building the assorted git snapshots I've downloaded, all with 2.6.9.
3. 2.6.9.6-06404cda5 - OK.
4. 2.6.9.56-6b0e43beb  - OK - OK with gc-common patch applied too -- I noticed there were a lot more Verify after GC messages. 
5. 2.3.9.59-dccfdbc5b - OK - OK with the gc-common patch applied as well.
6. 2.3.9.80-78f6a9756 - failed as above (unboxed object in scavenge_control_stack)  - with gc-common patch it did not fail.  So clearly something is slightly perturbed and that hides the problem.  Lovely.
7. 2.3.9.83-139f0d205 - failed as above (unboxed object in scavenge_control_stack)  - with gc-common patch of course it doesn't fail.
8. 2.3.9.89-777d37a3f - This fails in monitor.c with reg_CSP as already reported and fixed.
9. 2.3.9.93-60c142d3f - OK - Ok with gc-common patch.
10. 2.3.9.95-152c56702 - Ok - Ok with gc-common patch

So I'll keep poking at this.   The base version of SBCL that you build with seems to be a dependency as well. 

I'm guessing that I'm the only one who runs on ARM32 so I suspect it is not worth others spending time on it.

Thanks again for all of this work.

cheers

bruce

Douglas Katzman via Sbcl-devel

unread,
Oct 29, 2023, 1:21:35 PM10/29/23
to Bruce O'Neel, SBCL Devel
If you can give a recipe that will fairly consistently crash with the same kind of crash (either the "unexpected forwarding pointer" or "unboxed object"), I should be able to work with that; but so far nobody but you has been able to see any crash. Could it therefore depend on something about your libc and compiler? 
As far as I know, the only machine we have for testing on is gcc117 but if you can produce it under QEMU that might be even better, despite the slowness, because then we could be sure to run the same thing you're running.

Bruce O'Neel

unread,
Oct 29, 2023, 2:43:21 PM10/29/23
to Douglas Katzman, SBCL Devel
Hi,

Thanks, but I can't come up with a failure now that I've moved to Debian 12.2.  All the previous failure cases work.

Debian 12.2 includes gcc 12.2 as well as libc 2.36-9.

So I think that your idea that it is gcc and/or libc version dependent is correct.

Again I will keep poking at it but I may have been lucky before to trigger it with Debian 11.

Thanks

cheers

bruce
Reply all
Reply to author
Forward
0 new messages