Stuck in unittest.lsp while trying to build on FreeBSD

67 views
Skip to first unread message

Peer Stritzinger

unread,
Feb 21, 2012, 2:36:53 PM2/21/12
to juli...@googlegroups.com
Hi,

I'm trying to get julia built on FreeBSD (FreeBSD 8.2-RELEASE amd64),
after a few tweaks I'm now stuck at:

llvm[4]: Installing Release Archive Library
/usr/home/peer/julia/external/root/lib/libprofile_rt.a
    CC src/support/dirpath.o
    LINK src/support/libsupport.a
    CC src/flisp/flisp
assert-failed: (equal? +nan.0 +nan.0)
in file unittest.lsp
#0 (lambda)
gmake[2]: *** [flisp] Error 1
gmake[1]: *** [flisp/libflisp.a] Error 2
gmake: *** [julia-release] Error 2

Any ideas what would cause this?

How could I further debug it?

Cheers,
Peer Stritzinger

Jason E. Aten

unread,
Feb 21, 2012, 11:30:11 PM2/21/12
to juli...@googlegroups.com
Hi Peer,

hmmm.... this is really a long shot, but maybe FreeBSD doesn't set bit 7 of MXCSR for its processes?

If you compile and run it, what does the following program display?

#include <stdio.h>
// file: diagnose_mxcsr.c
// compile with: gcc -g -mtune=native -o diag diagnose_mxcsr.c
int main(int argc, char** argv) {
       unsigned int mxcsr;
       asm ("stmxcsr %0" : "=m" (*&mxcsr));
       printf("mxcsr was 0x%x\n",mxcsr);
       return 0;
}

- Jason

Peer Stritzinger

unread,
Feb 22, 2012, 4:39:52 AM2/22/12
to juli...@googlegroups.com
> hmmm.... this is really a long shot, but maybe FreeBSD doesn't set bit 7 of
> MXCSR for its processes?
> If you compile and run it, what does the following program display?


$ gcc -g -mtune=native -o diag diagnose_mxcsr.c
$ ./diag
mxcsr was 0x1f80

So it looks as if bit 7 is set ...

Cheers,
-- Peer

Jason E. Aten

unread,
Feb 22, 2012, 6:00:24 AM2/22/12
to juli...@googlegroups.com
yep. no clue here then.  maybe comment out the nan tests and see if that is the only issue?

Peer Stritzinger

unread,
Feb 22, 2012, 10:54:10 AM2/22/12
to juli...@googlegroups.com
On Wed, Feb 22, 2012 at 12:00 PM, Jason E. Aten <j.e....@gmail.com> wrote:
> yep. no clue here then.  maybe comment out the nan tests and see if that is
> the only issue?

Took a while since I had trouble rebuilding (needed a "make cleanall"
otherwise it seems to skip the tests.

If I comment out only the offending NaN test, the other NaN tests seem
to pass and then I get:


CC src/flisp/flisp
assert-failed: (equal? 0.0 0.0)


in file unittest.lsp
#0 (lambda)
gmake[2]: *** [flisp] Error 1
gmake[1]: *** [flisp/libflisp.a] Error 2
gmake: *** [julia-release] Error 2

It seems to be a problem with equal? for same literal float values.

So I comment out the equal? 0.0 0.0 assert also:

Now it seems the remaining tests pass but now it can't read flisp.boot later on.

CC src/flisp/flmain.o
CC src/flisp/flisp
FLISP src/julia_flisp.boot
fatal error:
(io-error "file: could not open \"flisp.boot\"")
gmake[1]: *** [julia_flisp.boot] Error 1

Summary:
=======

On FreeBSD 8.2 on a amd64 system I have to disable these tests:

diff --git a/src/flisp/unittest.lsp b/src/flisp/unittest.lsp
index 9ebd491..3b0df0e 100644
--- a/src/flisp/unittest.lsp
+++ b/src/flisp/unittest.lsp
@@ -77,7 +77,7 @@
(assert (equal? (string 'sym #byte(65) #wchar(945) "blah") "symA\u03B1blah"))

; NaNs
-(assert (equal? +nan.0 +nan.0))
+;;;(assert (equal? +nan.0 +nan.0))
(assert (not (= +nan.0 +nan.0)))
(assert (not (= +nan.0 -nan.0)))
(assert (equal? (< +nan.0 3) (> 3 +nan.0)))
@@ -92,7 +92,7 @@

; -0.0 etc.
(assert (not (equal? 0.0 0)))
-(assert (equal? 0.0 0.0))
+;;;(assert (equal? 0.0 0.0))
(assert (not (equal? -0.0 0.0)))
(assert (not (equal? -0.0 0)))
(assert (not (eqv? 0.0 0)))

and then I'm stuck at the next step.

Jeff Bezanson

unread,
Feb 22, 2012, 1:06:12 PM2/22/12
to juli...@googlegroups.com
src/flisp/flisp.boot is part of the source tree, so it should be
there. If the file got deleted somehow, you can recover it with git
checkout -f flisp.boot.
The other problem is mysterious and I don't have any ideas yet.

Peer Stritzinger

unread,
Feb 22, 2012, 1:11:09 PM2/22/12
to juli...@googlegroups.com
On Wed, Feb 22, 2012 at 7:06 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
> src/flisp/flisp.boot is part of the source tree, so it should be
> there. If the file got deleted somehow, you can recover it with git
> checkout -f flisp.boot.

But flisp.boot is there, readable and unchanged:

$ ls -l src/flisp/flisp.boot
-rw-r--r-- 1 peer staff 36288 Feb 22 14:59 src/flisp/flisp.boot

Jeff Bezanson

unread,
Feb 22, 2012, 1:19:42 PM2/22/12
to juli...@googlegroups.com
Oh, I bet the problem there is my /proc filesystem trick for locating
the executable.
Is there anywhere that will let me run a FreeBSD VM to connect to?

Peer Stritzinger

unread,
Feb 22, 2012, 2:31:10 PM2/22/12
to juli...@googlegroups.com
On Wed, Feb 22, 2012 at 7:19 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
> Oh, I bet the problem there is my /proc filesystem trick for locating
> the executable.

You'd won the bet :-)

This pointed me in the right direction, in FreeBSD this is usually
done with sysctl.

Added a FreeBSD version of get_exename and a few hacks further on:

$ uname -s -r -m
FreeBSD 8.2-RELEASE amd64

./julia
_
_ _ _(_)_ |
(_) | (_) (_) | A fresh approach to technical computing
_ _ _| |_ __ _ |
| | | | | | |/ _` | | Version 0.0.0-prerelease
| | |_| | | | (_| | | Commit 3c3e0aecef (2012-02-21 06:58:08)*
_/ |\__'_|_|_|\__'_| |
|__/ |

julia> 1+2

There still is the issue with the broken unit test ...

I'll submit my changes to get to this point and write up a short
description how to get there.


3

Jeff Bezanson

unread,
Feb 22, 2012, 3:05:22 PM2/22/12
to juli...@googlegroups.com
Beautiful, it's great to have this fix in place. The flisp float
comparing stuff is concerning, but it might not matter since we just
use that for symbolic stuff in the compiler front-end.
If I can get my hands on a freebsd box I will look at it though.

Peer Stritzinger

unread,
Feb 22, 2012, 5:41:15 PM2/22/12
to juli...@googlegroups.com
On Wed, Feb 22, 2012 at 9:05 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
> Beautiful, it's great to have this fix in place. The flisp float
> comparing stuff is concerning, but it might not matter since we just
> use that for symbolic stuff in the compiler front-end.
> If I can get my hands on a freebsd box I will look at it though.

Sent pull request https://github.com/JuliaLang/julia/pull/448

With this and following the description I added to the README.md it
should be easy to reproduce this.

If I can help out debugging the flisp tests please let me know.

-- Peer

Jeff Bezanson

unread,
Feb 25, 2012, 10:46:34 PM2/25/12
to juli...@googlegroups.com
I managed to get a freebsd shell account and I can reproduce the
issue. I get the +nan.0 bug in optimized builds, and not debug builds.
gdb doesn't seem to be working; once the program starts it just hangs,
and "b main" doesn't even work.
I am using memcmp() to compare the data of the 2 arguments; while this
is a bit strange I'm not sure why it wouldn't work. The two NaNs
originate from the same bit pattern in the reader.
Not sure what to try next.

Peer Stritzinger

unread,
Feb 26, 2012, 8:43:05 AM2/26/12
to juli...@googlegroups.com
On Sun, Feb 26, 2012 at 4:46 AM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
> I managed to get a freebsd shell account and I can reproduce the
> issue. I get the +nan.0 bug in optimized builds, and not debug builds.
> gdb doesn't seem to be working; once the program starts it just hangs,
> and "b main" doesn't even work.

GDB works for me for julia and flisp -- any suggestions how to debug this?

$ gdb ./julia
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
(gdb) b main
Breakpoint 1 at 0x41aaf4
(gdb) r
Starting program: /usr/home/peer/julia/julia
[New LWP 100331]
[New Thread 801a041c0 (LWP 100331)]
[Switching to Thread 801a041c0 (LWP 100331)]

Breakpoint 1, 0x000000000041aaf4 in main ()

$ gdb src/flisp/flisp
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging
symbols found)...
(gdb) b main
Breakpoint 1 at 0x41dfd0
(gdb) r
Starting program: /usr/home/peer/julia/src/flisp/flisp
(no debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...
Breakpoint 1, 0x000000000041dfd0 in main ()
(gdb) s
Single stepping until exit from function main,
which has no line number information.
; _
; |_ _ _ |_ _ | . _ _
; | (-||||_(_)|__|_)|_)
;-------------------|----------------------------------------------------------


> I am using memcmp() to compare the data of the 2 arguments; while this
> is a bit strange I'm not sure why it wouldn't work. The two NaNs
> originate from the same bit pattern in the reader.
> Not sure what to try next.

Let me know if I can help.

-- Peer

Peer Stritzinger

unread,
Feb 26, 2012, 9:19:46 AM2/26/12
to juli...@googlegroups.com
Played around with it a bit and thats what I found out so far:

If I build flisp-debug by running "gmake debug" in the flisp directory
(equal? +nan.0 +nan.0) still returns #f as in the nodebug version.

The debug version also runs with gdb and the #f return value is also
returned with gdb.

Without understanding any of flisps I found out that a breakpoint in
equal_lispvalue is triggered by the equal? call. Actually its
triggered multiple times until the repl prints the #f. Can't make
anything of the values passed since they are only shown as numbers,
can only say: sometimes they are the same and sometimes they are
different.

If you can tell me where to look this would be no problem at all.

-- Peer

Jeff Bezanson

unread,
Feb 26, 2012, 3:47:30 PM2/26/12
to juli...@googlegroups.com
OK I found something worth trying. On src/support/operators.c:184:

return *(uint64_t*)&da == *(uint64_t*)&db;

I've seen problems with this kind of type-punning before. Could you
try replacing the pointer cast with a union?

Jason E. Aten

unread,
Feb 26, 2012, 5:43:25 PM2/26/12
to juli...@googlegroups.com

One idea: looking at operators.c:11 at conv_to_double(): this function seems to read high-order-bytes from possibly uninitialized garbage memory (depending on compilers data layout/optimization level), when it does the casts of *data, say from 32-bit float to 64-bit double.  If *data is a float, then it would have only had 32-bits of info behind it, no?  The (double) cast will read and convert 64-bits no matter what.  If NaN always becomes a double, then this wouldn't explain everything though, only if the reader could generate 32-bit NaN floats...is that possible?


double conv_to_double(void *data, numerictype_t tag)
{
    double d=0;
    switch (tag) {
    case T_INT8:   d = (double)*(int8_t*)data; break;
    case T_UINT8:  d = (double)*(uint8_t*)data; break;
    case T_INT16:  d = (double)*(int16_t*)data; break;
    case T_UINT16: d = (double)*(uint16_t*)data; break;
    case T_INT32:  d = (double)*(int32_t*)data; break;
    case T_UINT32: d = (double)*(uint32_t*)data; break;
    case T_INT64:
        d = (double)*(int64_t*)data;
        if (d > 0 && *(int64_t*)data < 0)  // can happen!                                                                                                    
            d = -d;
        break;
    case T_UINT64: d = (double)*(uint64_t*)data; break;
    case T_FLOAT:  d = (double)*(float*)data; break;
    case T_DOUBLE: return *(double*)data;
    }
    return d;

Jeff Bezanson

unread,
Feb 26, 2012, 5:50:16 PM2/26/12
to juli...@googlegroups.com
* is never applied to data until it has been cast to the correct
pointer type, so I don't think this happens. If data points to a
float, I do *(float*)data, which only reads 32 bits, and then that
value is cast to double without touching memory again.
But, as you surmise, the +nan.0 is a double.

Peer Stritzinger

unread,
Feb 28, 2012, 9:24:49 AM2/28/12
to juli...@googlegroups.com
I think I've found the problem:

On Sun, Feb 26, 2012 at 9:47 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:
> OK I found something worth trying. On src/support/operators.c:184:
>
> return *(uint64_t*)&da == *(uint64_t*)&db;
>
> I've seen problems with this kind of type-punning before. Could you
> try replacing the pointer cast with a union?

Once you mentioned this it smelled like a strict aliasing violation to
me. So I build julia with the flag -fno-strict-aliasing in CFLAGS.

And voila julia builds without any test-case violation.

This also explains why you can't see the problem when building for
debugging (strict aliasing is only having a effect when using -O2 or
higher in gcc).

Basically strict aliasing says the compiler makes the assumption that
dereferencing pointers to objects of different types will never refer
to the same memory location.

Best explanation of this can be found here:
http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html

More info here:
http://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule

This is a new thing added to C99 and therefore in the meantime used by
default in gcc. This makes some optimizations possible like keeping
pointed to values in registers because the compiler can be sure that
the same memory location is not modified through another pointer.

The way around this is consistent use of unions for all these cases
all over the code. There is a warning in gcc that can show you some
points where this is violated. Unfortunately gcc is very inconsistent
with these warnings (IIRC: there are two steppings of the warning: one
that shows to few and one that shows to many problems, also gcc might
even miss some points in the second setting)

My suggestion is:

* switch on -fno-strict-aliasing and then either:

- work with the warnings and maybe other tools to root out all
aliasing problems then switch it off again.

- decide you want to be able to use a aliasing style and keep it switched on.

Personally I'm in favor of keeping -fno-strict-aliasing since the
aliasing problems cause hard to detect bugs (I'm burnt by having
debugged a real time os scheduler for 6 weeks fultime because of a
aliasing problem). Its basically how C is still believed to work (but
it doesn't anymore) when compiled with -fno-strict-aliasing

OTOH performance of C code might improve if strict aliasing is observed.

If performance is chosen over possible incorrectness here some way
should be in place to avoid aliasing creeping back in.

Stefan Karpinski

unread,
Feb 28, 2012, 10:26:34 AM2/28/12
to juli...@googlegroups.com
It's interesting to me that gcc on different platforms behaves differently with respect to this matter. I would have expected more problems going between different versions of gcc or between gcc and clang.

peerst

unread,
Feb 28, 2012, 10:34:49 AM2/28/12
to julia-dev

On Feb 28, 4:26 pm, Stefan Karpinski <ste...@karpinski.org> wrote:
> It's interesting to me that gcc on different platforms behaves differently
> with respect to this matter. I would have expected more problems going
> between different versions of gcc or between gcc and clang.

Well it might be different versions of gcc involved, also some
platforms may choose to switch on -fno-strict-aliasing in the gcc
specs so its the compiler default (usually because the system itself
has problems with memory aliasing code).

For maximum compatibility and best performance all pointer aliasing
should be eliminated since it is no longer consistently supported by C
since C99 (and C compilers even longer).

BTW I tried the -Wstrict-aliasing warning levels, the found a few
places but not the one that caused this problem.

Basically every pointer cast is a problem.

Also unions with different pointer types as members.

-- Peer
Reply all
Reply to author
Forward
0 new messages