Hi everyone,
Current telescopes and missions are producing very large data sets – for instance GAIA DR1 has over 1.1 billion entries in the initial release, this only includes basic information on the observed stars. There is a lot more to come in the future particularly when LSST & the SKA come on line. Astronomy is now heading into being a science of analysing these data sets to find unusual and interesting objects. Two new branches of information theory / statistics / astronomy now exist AstroStatistics and Astroinformatics to try to make sense of the data avalanche.
To illustrate the size of the problem even with the restricted data in Gaia DR1 I tried to run a query on the database to count the number of stars observed by Gaia in 0.1 magnitude bins. Fairly easy one would think but the query timed out after 30 minutes of CPU time without completing. Having got some advice from the Gaia Data Team I ran a revised query on a random 1/1000th of the data which only took a few minutes to complete. Not sure how one manages to do anything complicated on a large area.
As this is the new reality of astronomical research a knowledge of the tools available is useful if not vital. To this end the University of Sydney have written a MOOC on Data-driven Astronomy which covers some of the theory and practicalities of analysing large data sets. A number of experts in the field are making contributions, this includes Prof Karen Masters from Portsmouth (which is how I got to hear about the course). If you are interested in keeping up with how to mine the data in the modern world of astronomy the details of the course are at: https://www.coursera.org/learn/data-driven-astronomy it starts on 13th of March 2017 so plenty of time to enrol.
Hopefully the article I am writing using the Gaia DR1 statistics will be complete before I start this course.
Good data mining,
John Murrell
--
You received this message because you are subscribed to the Google Groups "Altair_B" group.
To unsubscribe from this group and stop receiving emails from it, send an email to altair_b+u...@googlegroups.com.
To post to this group, send email to alta...@googlegroups.com.
Visit this group at https://groups.google.com/group/altair_b.
For more options, visit https://groups.google.com/d/optout.
Hello James,
The challenge is that we are moving from the era where you could download the data to your local machine and then analyse it back to the era where the computing has to be done where the data is stored. Part of the challenge is how the CPU time and the data storage is paid for. At present ESA seem willing to allow people ½ hour free usage per query.
I did some analysis of the original Hipparcos data for a project in 1999 – getting the data was relatively easy one just sent some money to ESA and they sent you a book describing the data and a folder of CD-ROMS. The problem then was analysing the data as the size of the data set was at or beyond what PCs could handle at the time.
Now I can download the data over the internet, analyse it and print a graph of the results in well under an hour.
However the GAIA data set is so large the internet is not wide / fast enough to deliver it and the cost of local storage is probably prohibitive. SKA & LSST will be even worse of course.
It is interesting that I would not be able to run the same analysis I did on the Hipparcos data on the Gaia data as ½ hour is nowhere near long enough to run the query. From what I can see of other queries being run on the Gaia data they are looking at very small areas, the indexing must make extracting the data for small areas more efficient. That’s where the skill in the information management is required to minimise the amount of CPU and disk time.
The query I eventually ran to calculate the number or stars observed by Gaia relied on the provision of a random index – selection 0.1% of the data using the random index should give me ‘approximate’ statistics for the whole data set – assuming that the index is in fact random !
Challenging times ahead
John