Creation of indexes based on GAIA EDR 3

117 views
Skip to first unread message

Rafael Morales

unread,
Nov 5, 2021, 8:05:16 AM11/5/21
to astrometry
Dear all,

  I have created the set of indexes based on GAIA EDR 3 following the steps described in: "http://astrometry.net/doc/build-index.html"
  I want to share the process in case anyone wants to create their own indexes.
  I've two main concerns:

    a) The resulting indexes are not so compact as shown in (Tycho-2 + Gaia-DR2 indexes): "https://portal.nersc.gov/project/cosmo/temp/dstn/index-5200/LITE/"

    b) The resulting indexes are not valid FITS ("https://heasarc.gsfc.nasa.gov/docs/software/ftools/fitsverify/"). But it is also true for the indexes provided in:
       "http://data.astrometry.net/4200/"
       "https://portal.nersc.gov/project/cosmo/temp/dstn/index-5200/LITE/"

    The problem is in the program "build-astrometry-index" but I do not know where.

  Any comments will be appreciated.
  All scripts are running on gnu-linux.

  //--------------------------------------
  1)   First of all it is required to obtain a csv file with the list of GAIA EDR 3 sources. In my case, I download the sources from:
  "http://cdn.gea.esac.esa.int/Gaia/gedr3/gaia_source/" and imported into a NoSQL database (mongoDB: https://www.mongodb.com/).
  This is the scritp for creating the csv file from the database:

  #create csv
  DATABASE=YOUR_DATABASE
  COLLECTION=YOUR_COLLECTION
  OUTPUT_FILENAME="gaia_edr_3.csv"
  FIELD_LIST=_id,ra,dec,phot_g_mean_flux


  ./mongoexport --config=./config.yaml --db=$DATABASE --collection=$COLLECTION --type=csv --fields=$FIELD_LIST --query '{ "_id": { "$gte": 0  } }' --out=$OUTPUT_FILENAME


   gaia_edr_3.csv file size: 137349819612 B => 127.92 GiB

  //--------------------------------------
  2) Now sort the csv file by ra,dec
   sort  -t","  -k2,2 -k3,3   -o gaia_edr_3_sorted.csv gaia_edr_3.csv

  //--------------------------------------
  3) Create FITS binary table from csv file
  Due to the large number of sources: (1811709771) traditional tools do not work. So. I've created a new tool called "gecibat" that  creates the FITS file in stream mode: "https://gitlab.com/rmorales_iaa/gecibat"
 
   ./gecibat gaia_edr_3_sorted.csv gaia_edr_3.fits

   gaia_edr_3.fits file size 57974719680 B => 53.99 GiB
  //--------------------------------------
  4) Split the  FITS binary table

   rm -fr split_heal_pix
   mkdir split_heal_pix
   cd split_heal_pix
   hpsplit ../gaia_edr_3.fits -o  gaia_ed3_heal_pix-%02i.fits -n 2 -m 0.01

   First 6 heal pix:
   153714240 Nov  4 11:56 gaia_ed3_heal_pix-00.fits
   577376640 Nov  4 11:56 gaia_ed3_heal_pix-01.fits
   817413120 Nov  4 11:56 gaia_ed3_heal_pix-02.fits
   948913920 Nov  4 11:56 gaia_ed3_heal_pix-03.fits
   146597760 Nov  4 11:56 gaia_ed3_heal_pix-04.fits
   304107840 Nov  4 11:56 gaia_ed3_heal_pix-05.fits
  //--------------------------------------
  5) Create the indexes(final size: 43.6 GiB) using the following script:

#!/bin/bash
#------------------------------------------------------------------------------
#Create the indexes files used by astrometry.net
#it requires gnu "parallel" : https://www.gnu.org/software/bash/manual/html_node/GNU-Parallel.html
#debian: sudo apt install parallel
#fedora: sudo dnf install parallel
#------------------------------------------------------------------------------
INPUT_DIR=split_heal_pix/
INPUT_FILE_PREFIX=gaia_ed3_heal_pix-
OUTPUT_DIR=output
#-----------------------------------------------------------------------------
#set -xv  #DEBUG ON
#-----------------------------------------------------------------------------
#to set scale, see: http://astrometry.net/doc/readme.html#getting-index-files

#healpix NSide = 2
MIN_HEAL_PIX=0
MAX_HEAL_PIX=47

#from 6 arcmin (MIN_SCALE=9) to 2 degrees (MAX_SCALE=8)
MIN_SCALE=0   
MAX_SCALE=8

MARGIN_DEGREES=0.01
JITTER_ARCOSEC=0.1

HEAL_PIX_N_SIDE=2

SORTING_FIELD=FLUX #if you are using a magnitude field use '-J' instead '-f' in the command line below
DEDUPLICATION_RADIUS_ARCOSEC=1
MAX_REUSES=16
#------------------------------------------------------------------------------
startTime=$(date +'%s')
echo "========================================================================="
echo $(date +"%Y-%m-%d:%Hh:%Mm:%Ss") 'Starting script 3'
echo "========================================================================="
#------------------------------------------------------------------------------
rm -fr $OUTPUT_DIR
mkdir $OUTPUT_DIR

DATE=$(date '+%y%m%d')
echo "SCRIPT 3. CURRENT PATH        : '$PWD'"
echo "SCRIPT 3. Building GAIA indexes on directory:'$OUTPUT_DIR'"

for ((HEAL_PIX=MIN_HEAL_PIX; HEAL_PIX<=MAX_HEAL_PIX; HEAL_PIX++)); do
  echo "============>heal pix:$HEAL_PIX starts <============"
  for ((SCALE=$MIN_SCALE; SCALE<=$MAX_SCALE; SCALE++)); do
     
    echo "------>heal pix:$HEAL_PIX scale:$SCALE <-------"
    
    FILE_NAME_AND_EXTENSION="${INPUT_FITS##*/}"
    ONLY_FILENAME="${FILE_NAME_AND_EXTENSION%.*}"
 
    HEALPIX_FORMATTED=$(printf "%02d" $HEAL_PIX)
    SCALE_FORMATTED=$(printf "%02d" $SCALE)
         
    ONAME=$OUTPUT_DIR/"index"-$HEALPIX_FORMATTED-$SCALE_FORMATTED$ODIR.fits   
    
    INPUT_FITS=$INPUT_DIR/$INPUT_FILE_PREFIX$HEALPIX_FORMATTED.fits
    
    ID=$HEALPIX_FORMATTED$SCALE_FORMATTED
    
    sem -j +0 ./build-astrometry-index \
    -I $ID \
    -i $INPUT_FITS \
    -o $ONAME \
    -H $HEAL_PIX \
    -P $SCALE \
    -s $HEAL_PIX_N_SIDE \
    -S $SORTING_FIELD \
    -f \
    -j $JITTER_ARCOSEC \
    -r $DEDUPLICATION_RADIUS_ARCOSEC \
    -L $MAX_REUSES \
    -m $MARGIN_DEGREES \
    -M    
            
  done
  sem --wait
  echo "============>heal pix:$HEAL_PIX ends <============"
done  

#-------------------------------------------------------------------------------
echo "========================================================================="
echo $(date +"%Y-%m-%d:%Hh:%Mm:%Ss") 'End of script 3'
echo "========================================================================="
echo "Elapsed time: $(($(date +'%s') - $startTime))s"
echo "-------------------------------------------------------------------------"



First indexes for heal pix 0
408234240 Nov  4 19:54 index-00-00.fits
248892480 Nov  4 19:50 index-00-01.fits
131970240 Nov  4 19:48 index-00-02.fits
66602880 Nov   4 19:48 index-00-03.fits
33157440 Nov   4 19:47 index-00-04.fits
16704000 Nov   4 19:47 index-00-05.fits
8349120 Nov    4 19:47 index-00-06.fits
4230720 Nov    4 19:47 index-00-07.fits
2142720 Nov    4 19:47 index-00-08.fits
 


That's all :-)

Dustin Lang

unread,
Nov 5, 2021, 8:22:06 AM11/5/21
to Rafael Morales, astrometry
For DR2, we just downloaded each of the CSV files, converted each one to FITS, and then fed those to hpsplit -- no need for mongo or to create a single giant csv then a single giant FITS :)  But whatever works!

One issue: Gaia-EDR3 (like previous releases) is not complete at the bright end (around 11th-12th mag).  For the 5200-series, I merged Tycho-2 and Gaia (carefully) to create a more complete input catalog before hpsplit and indexing.

Regarding the 'invalid FITS' complaint -- yes, we abuse the FITS convention, storing binary data as strings, and apparently fitsverify doesn't like non-ASCII strings.  Presumably we could tell it we are storing uint8 and it would not complain.

cheers,
--dustin



--
You received this message because you are subscribed to the Google Groups "astrometry" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrometry+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/astrometry/9dfd2b00-9f2b-480b-a507-7a5e79240555n%40googlegroups.com.

Rafael Morales

unread,
Nov 5, 2021, 9:08:37 AM11/5/21
to astrometry
Hi Dusting,

Thank you for your quick response.

I do not understand why I have final indexes with sizes from 2MiB to 485MiB (GAIA2 has always around 385MiB). Is it because the Tycho-2 catalog?
Other question more, is it necessary to sort the fat big FITS file before splitting?

Thanks in advance.

Rafa.

Dustin Lang

unread,
Nov 5, 2021, 9:19:43 AM11/5/21
to Rafael Morales, astrometry
The file sizes look fine -- index scale 00 selects ~2x as many stars as index scale 01, and so on.

You don't need to sort your big FITS file because you're giving hpsplit the "-S" (sort) command-line arg.



Rafael Morales

unread,
Nov 8, 2021, 2:38:45 AM11/8/21
to astrometry
Thank you! Now it is clear.
Reply all
Reply to author
Forward
0 new messages