Segfault when building indices on large catalog.

67 views
Skip to first unread message

Sergey Koposov

unread,
Apr 1, 2013, 7:34:23 AM4/1/13
to astro...@googlegroups.com
Hi Dustin,

I'm trying to build custom indices on large subset of  SDSS DR9 data, and I'm consistently getting segfaults. Initially I saw the bug with astrometry.net 0.42, but svn-22501 showed it as  well
Here is the short and long gdb backtrace:

(gdb) bt
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:119
#1  0x0000000000485d6a in read_chunk (fb=0x21316a00, chunk=0x21317090) at fitsbin.c:511
#2  0x0000000000485f90 in fitsbin_read (fb=0x21316a00) at fitsbin.c:551
#3  0x0000000000413d89 in quadfile_switch_to_reading (qf=0x21316cc0) at quadfile.c:230
#4  0x0000000000404391 in step_hpquads (p=0x7fffffffdcc0, p_codes=0x7fffffffdab8, p_quads=0x7fffffffdab0,
    p_codefn=0x7fffffffda80, p_quadfn=0x7fffffffda88, starkd=0x6afa2dd0, skdtfn=0x0, tempfiles=0x3ded6c0)
    at build-index.c:81
#5  0x0000000000405d2a in build_index (catalog=0x3de6010, p=0x7fffffffdcc0, p_index=0x7fffffffdba0,
    indexfn=0x0) at build-index.c:579
#6  0x0000000000406312 in build_index_files (infn=0x7fffffffe206 "stars.fits",
    indexfn=0x7fffffffe214 "index-130329000.fits", p=0x7fffffffdcc0) at build-index.c:672
#7  0x0000000000406e9f in main (argc=21, argv=0x7fffffffdec8) at build-index-main.c:279


(gdb) bt full
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:119
No locals.
#1  0x0000000000485d6a in read_chunk (fb=0x21316a00, chunk=0x21317090) at fitsbin.c:511
        i = 0
        tabstart = 0
        tabsize = 0
        ext = 0
        expected = 18446744072229333200
        mode = 64928016
        flags = 0
        mapstart = 140737488345232
        mapoffset = 0
        table_nrows = 175921805
        table_rowsize = 16
        inmemext = 0x1022c41f0
        __func__ = "read_chunk"
#2  0x0000000000485f90 in fitsbin_read (fb=0x21316a00) at fitsbin.c:551
        chunk = 0x21317090
        i = 0
#3  0x0000000000413d89 in quadfile_switch_to_reading (qf=0x21316cc0) at quadfile.c:230
        __func__ = "quadfile_switch_to_reading"
#4  0x0000000000404391 in step_hpquads (p=0x7fffffffdcc0, p_codes=0x7fffffffdab8, p_quads=0x7fffffffdab0,
    p_codefn=0x7fffffffda80, p_quadfn=0x7fffffffda88, starkd=0x6afa2dd0, skdtfn=0x0, tempfiles=0x3ded6c0)
    at build-index.c:81
        codes = 0x3dedb20
        quads = 0x21316cc0
        quadfn = 0x0
        codefn = 0x0
        __func__ = "step_hpquads"
#5  0x0000000000405d2a in build_index (catalog=0x3de6010, p=0x7fffffffdcc0, p_index=0x7fffffffdba0,
    indexfn=0x0) at build-index.c:579
        uniform = 0x3ded700
        starkd = 0x6afa2dd0
        startag = 0x1191412d0
        codes = 0x0
        quads = 0x0
        codekd = 0x0
        starkd2 = 0x0
        quads2 = 0x0
        startag2 = 0x0
        quads3 = 0x0
        codekd2 = 0x0
        tempfiles = 0x3ded6c0
        unifn = 0x0
        skdtfn = 0x0
        quadfn = 0x0
        codefn = 0x0
        ckdtfn = 0x0
        skdt2fn = 0x0
        quad2fn = 0x0
        quad3fn = 0x0
        ckdt2fn = 0x0
        __PRETTY_FUNCTION__ = "build_index"
        __func__ = "build_index"
#6  0x0000000000406312 in build_index_files (infn=0x7fffffffe206 "stars.fits",
    indexfn=0x7fffffffe214 "index-130329000.fits", p=0x7fffffffdcc0) at build-index.c:672
        index = 0x6c2340
        catalog = 0x3de6010
        __func__ = "build_index_files"
#7  0x0000000000406e9f in main (argc=21, argv=0x7fffffffdec8) at build-index-main.c:279
        argchar = -1
        infn = 0x7fffffffe206 "stars.fits"
        indexfn = 0x7fffffffe214 "index-130329000.fits"
        inindexfn = 0x0
        myp = {racol = 0x497768 "RA", deccol = 0x49776b "DEC", jitter = 0.40000000000000002,
          sortcol = 0x7fffffffe23e "r", sortasc = 1 '\001', brightcut = -inf, bighp = -1, bignside = 0,
          sweeps = 100, dedup = 1, margin = 0, UNside = 1760, Nside = 1760, hpquads_sort_data = 0x0,
          hpquads_sort_func = 0, hpquads_sort_size = 0, qlo = 2, qhi = 2.7999999999999998, passes = 16,
          Nreuse = 8, Nloosen = 20, scanoccupied = 1 '\001', dimquads = 4, indexid = 130329000,
         inmemory = 1 '\001', delete_tempfiles = 1 '\001', tempdir = 0x49776f "/tmp", args = 0x7fffffffdec8,
          argc = 21}
        p = 0x7fffffffdcc0
        loglvl = 2
        i = 0
        preset = 0
        __func__ = "main"

The command line used was:
../astrometrynet/bin/build-index -M -i stars.fits -o index-${P}00.fits -I ${P}00 -P 0 -S r -n 100 -L 20 -E -j 0.4 -r 1
The catalog has ~ 100e6 objects. And the fits file is 8Gb in size.

I'm seeing the bug on debian 6.0 system with large amount of RAM.

Any suggestions ?

Cheers,
          Sergey
PS I did run the code before on the same kind of file but with smaller number of objects (DR7 and smaller magnitude range) and it did run fine.

Dustin Lang

unread,
Apr 1, 2013, 8:36:41 AM4/1/13
to astro...@googlegroups.com
Hi Sergey,

It's probably a 4 GB file size error.  The qfits library I use fails on big files due to the usual int vs size_t problems.  I thought I had fixed them, but as you can imagine it's hard to completely fix a library with a problem like that.

Oh, could you try without the "-M" option?

One temporary hack solution would be to use the "hpsplit" program to split your stars.fits into 12 or 48 healpix tiles (Nside=1 or Nside=2 healpix tiles), and then run build-index on each of those.  That will also have the advantage that you can take advantage of initial RA,Dec guesses if you have them.

hpsplit stars.fits -o stars-hp%02i.fits -n 1 -m 1

(or -n 2); "-m 1" says to add a margin of 1 degree around the healpix boundaries.  Oh, and you'll also need to add "-c" entries for each of the extra FITS columns you want to copy (and eventually end up in the index files).

Then

for ((hp=0; hp<12; hp++)); do
  build-index $(printf stars-hp%02i.fits $hp) -H $hp -s 1 -o index-${P}00-$(printf %02i $hp)  # + your usual args below
done

where "-s 1" must match "-n 1" in hpsplit.

Sorry for the trouble.

cheers,
dustin

Sergey Koposov

unread,
Apr 1, 2013, 11:36:38 AM4/1/13
to astro...@googlegroups.com
Hi Dustin,

Thanks for the answer . I'm going to try one of your recipes (although since the index building takes ~ 1 day, it's a bit painful).
I have also looked at qfits, and managed to fix a few obvious qfits bugs with large files. What I did is that I tried to run the qfits test suite on my big FITS table, which allowed me to  spot them.
After my fix, the qfits regression test at least passes on 8G files.
I still don't know whether that's going to fix the build-index run (maybe some other bugs are lurking elsewhere).
But my fix should rectify qfits_query() calls which are used quite intensively in the qfits code.

See the attach.
Cheers,
        S
zz.diff

Dustin Lang

unread,
Apr 2, 2013, 9:55:24 AM4/2/13
to astro...@googlegroups.com
Hi Sergey,

Thanks for the stack trace, that was really helpful!  I think I found the immediate bug -- notice that the "expected" value is crazy, since I didn't cast the ints to size_t before multiplying them.  Sigh.  (And I was wrong to blame qfits this time, it was in my code.)

Would you be able to test the svn trunk version?  I'm running a build-index on a big input file, but haven't hit the bug yet...

svn co http://astrometry.net/svn/trunk/src

thanks,
dustin

Sergey Koposov

unread,
Apr 2, 2013, 10:13:37 AM4/2/13
to astro...@googlegroups.com

Hi Dustin,

On Tuesday, 2 April 2013 14:55:24 UTC+1, Dustin Lang wrote:
Hi Sergey,

Thanks for the stack trace, that was really helpful!  I think I found the immediate bug -- notice that the "expected" value is crazy, since I didn't cast the ints to size_t before multiplying them.  Sigh.  (And I was wrong to blame qfits this time, it was in my code.)

 
Thanks. I'm going to try SVN now (I'll know tomorrow if it works). But do you think that the qfits code I've fixed is not used in build-index? Because it was clearly wrong as well.

Cheers,
      S

Dustin Lang

unread,
Apr 2, 2013, 10:20:28 AM4/2/13
to astro...@googlegroups.com
Hi Sergey,

I started replacing some of the qfits code (in particular the qfits_query, which has kind of a weird design anyway), so I believe that code path is not used.  I have a similar fix in the "anqfits.c" code that replaces it; search "data_bytes" in
  http://trac.astrometry.net/browser/trunk/src/astrometry/qfits-an/src/anqfits.c

--dstn



Dustin Lang

unread,
Apr 3, 2013, 9:53:14 AM4/3/13
to astro...@googlegroups.com
With this fix, I ran build-index on a 4.2 GB input file which produced a 3.3 GB index file.  Sergey, how did your re-run go?

Sergey Koposov

unread,
Apr 3, 2013, 10:08:19 AM4/3/13
to astro...@googlegroups.com


On Wednesday, 3 April 2013 14:53:14 UTC+1, Dustin Lang wrote:
With this fix, I ran build-index on a 4.2 GB input file which produced a 3.3 GB index file.

Cool.

  Sergey, how did your re-run go?
 
I'm still running it...

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
15098 koposov   20   0 15.0g  14g 1056 R  100 25.3   1417:08 build-index    

It should finish in next couple of hours.

Dustin Lang

unread,
Apr 3, 2013, 10:23:45 AM4/3/13
to astro...@googlegroups.com
Somebody should multi-thread that code :)

Sergey Koposov

unread,
Apr 3, 2013, 10:30:01 AM4/3/13
to astro...@googlegroups.com


On Wednesday, 3 April 2013 15:23:45 UTC+1, Dustin Lang wrote:
Somebody should multi-thread that code :)


)
Yes, I was thinking the same, although I guess index-buidling is something which needs to be done so rarely (unless for debugging purposes), so it's probably not worth the effort.

Sergey Koposov

unread,
Apr 3, 2013, 3:33:07 PM4/3/13
to astro...@googlegroups.com


On Wednesday, 3 April 2013 14:53:14 UTC+1, Dustin Lang wrote:
With this fix, I ran build-index on a 4.2 GB input file which produced a 3.3 GB index file.  Sergey, how did your re-run go?


Hi Dustin,
I still get segfault:
Below is the backtrace, but it is obvious that the error is coming from line 511

There You have  i * chunk->itemsize
And i is int.
Since the chunk->itemsize is 16, this thing overflows  on i=134217728
So you only need to change i in fitsbin.c:508 to long int


 (gdb) bt full
#0  memcpy () at ../sysdeps/x86_64/memcpy.S:119
No locals.
#1  0x0000000000485dfc in read_chunk (fb=0x21316a00, chunk=0x21317090) at fitsbin.c:511
        i = 134217728

        tabstart = 0
        tabsize = 0
        ext = 0
        expected = 2814748880
        mode = 556887312
        flags = 0
        mapstart = 556888192

        mapoffset = 0
        table_nrows = 175921805
        table_rowsize = 16
        inmemext = 0x16d9ddf00
        __func__ = "read_chunk"
#2  0x0000000000486056 in fitsbin_read (fb=0x21316a00) at fitsbin.c:552
#6  0x0000000000406312 in build_index_files (infn=0x7fffffffe205 "stars.fits", indexfn=0x7fffffffe213 "index-130329000.fits", p=0x7fffffffdcc0)

    at build-index.c:672
        index = 0x6c2340
        catalog = 0x3de6010
        __func__ = "build_index_files"
#7  0x0000000000406e9f in main (argc=21, argv=0x7fffffffdec8) at build-index-main.c:279
        argchar = -1
        infn = 0x7fffffffe205 "stars.fits"
        indexfn = 0x7fffffffe213 "index-130329000.fits"
        inindexfn = 0x0
        myp = {racol = 0x497828 "RA", deccol = 0x49782b "DEC", jitter = 0.40000000000000002, sortcol = 0x7fffffffe23d "r", sortasc = 1 '\001',
          brightcut = -inf, bighp = -1, bignside = 0, sweeps = 100, dedup = 1, margin = 0, UNside = 1760, Nside = 1760, hpquads_sort_data = 0x0,
          hpquads_sort_func = 0, hpquads_sort_size = 0, qlo = 2, qhi = 2.7999999999999998, passes = 16, Nreuse = 8, Nloosen = 20,
          scanoccupied = 1 '\001', dimquads = 4, indexid = 130329000, inmemory = 1 '\001', delete_tempfiles = 1 '\001', tempdir = 0x49782f "/tmp",

Dustin Lang

unread,
Apr 3, 2013, 3:47:40 PM4/3/13
to astro...@googlegroups.com
Ouch!  Apologies for having wasted another day for you.

I put in [the equivalent of] that fix; svn up.  I think maybe I should write a unit test for this rather than making you suffer through 24-hour debug cycles!

By the way, does it work if you don't use the "in-memory" (-M) flag?

--dstn

Dustin Lang

unread,
Apr 3, 2013, 5:08:37 PM4/3/13
to astro...@googlegroups.com
I added a unit test of the quadfile module -- successfully writes and then reads 1e9 quads after the fixes you sent.

Thanks for your patience and testing!

--dstn

Sergey Koposov

unread,
Apr 3, 2013, 5:15:35 PM4/3/13
to astro...@googlegroups.com
No problem.

      S

Sergey Koposov

unread,
Apr 4, 2013, 3:11:48 PM4/4/13
to astro...@googlegroups.com


On Wednesday, 3 April 2013 22:08:37 UTC+1, Dustin Lang wrote:
Hi Dustin,

Apparently the problems aren't over. The run using temporary files finished before memory-only run and produced the following error:

 fitsbin.c:523:read_chunk: Expected table size (5629497760 => 463379 FITS blocks) is not equal to size of table "codes" (1334531264 => 463378 FITS blocks).
codefile.c:182:codefile_open: Failed to open codes file
codetree.c:56:codetree_files: Failed to read code file .//tmp.code.7pjEta
build-index.c:140:step_codetree: codetree failed

Looks like weird off-by one error ( 463379 vs 463378)...

The command line was
 ../../astrometrynet2/bin/build-index -i ../stars.fits -o index-${P}00.fits -I ${P}00 -P 0 -S r -n 100 -L 20 -E -j 0.4 -r 1 -t ./

Cheers,
      S

Dustin Lang

unread,
Apr 4, 2013, 4:35:45 PM4/4/13
to astro...@googlegroups.com
That could be due to a bug in the code that pads the FITS files out to a whole number of FITS blocks.

Argh.  I am going to write another unit test...

Oh, I see, I thought the 'fitsbin.c' code used my 'anqfits' big-file-safe code, but it uses the vanilla qfits code.  Oops!  I am working on a fix.

--dstn


Dustin Lang

unread,
Apr 4, 2013, 5:50:02 PM4/4/13
to astro...@googlegroups.com
Try now...

Sergey Koposov

unread,
Apr 4, 2013, 6:10:56 PM4/4/13
to astro...@googlegroups.com


On Thursday, 4 April 2013 22:50:02 UTC+1, Dustin Lang wrote:
Try now...

Thanks.
I'll tell you in a day ) whether that works.

Sergey Koposov

unread,
Apr 5, 2013, 8:01:07 PM4/5/13
to astro...@googlegroups.com


On Thursday, 4 April 2013 22:50:02 UTC+1, Dustin Lang wrote:
Try now...


Hi Dustin,
just FYI the svn r22527 still produces the same issue:

codetree: building KD tree for .//tmp.code.Z6AhBU
       will write KD tree file .//tmp.ckdt.GE4YT1
Reading codes...

fitsbin.c:523:read_chunk: Expected table size (5629497760 => 463379 FITS blocks) is not equal to size of table "codes" (1334531264 => 463378 FITS blocks).
codefile.c:182:codefile_open: Failed to open codes file
codetree.c:56:codetree_files: Failed to read code file .//tmp.code.Z6AhBU

build-index.c:140:step_codetree: codetree failed

I've updated to last revision 22530 and restarted.

Sergey Koposov

unread,
Apr 9, 2013, 4:05:06 PM4/9/13
to astro...@googlegroups.com
Hi Dustin,

Just to close the issue -- with the 22530 revision, I've finally been able to finish index creation (at least in the not-in-memory regime), so
it looks like most of the bugs related to big files have been fixed.
Thanks.
Cheers, 
          S
Reply all
Reply to author
Forward
0 new messages