Data Driven Astronomy & Gaia DR1

John Murrell

unread,

Mar 1, 2017, 5:35:46 PM3/1/17

to alta...@googlegroups.com

Hi everyone,

Current telescopes and missions are producing very large data sets – for instance GAIA DR1 has over 1.1 billion entries in the initial release, this only includes basic information on the observed stars. There is a lot more to come in the future particularly when LSST & the SKA come on line. Astronomy is now heading into being a science of analysing these data sets to find unusual and interesting objects. Two new branches of information theory / statistics / astronomy now exist AstroStatistics and Astroinformatics to try to make sense of the data avalanche.

To illustrate the size of the problem even with the restricted data in Gaia DR1 I tried to run a query on the database to count the number of stars observed by Gaia in 0.1 magnitude bins. Fairly easy one would think but the query timed out after 30 minutes of CPU time without completing. Having got some advice from the Gaia Data Team I ran a revised query on a random 1/1000^th of the data which only took a few minutes to complete. Not sure how one manages to do anything complicated on a large area.

As this is the new reality of astronomical research a knowledge of the tools available is useful if not vital. To this end the University of Sydney have written a MOOC on Data-driven Astronomy which covers some of the theory and practicalities of analysing large data sets. A number of experts in the field are making contributions, this includes Prof Karen Masters from Portsmouth (which is how I got to hear about the course). If you are interested in keeping up with how to mine the data in the modern world of astronomy the details of the course are at: https://www.coursera.org/learn/data-driven-astronomy it starts on 13th of March 2017 so plenty of time to enrol.

Hopefully the article I am writing using the Gaia DR1 statistics will be complete before I start this course.

Good data mining,

John Murrell

J R

unread,

Mar 2, 2017, 3:32:15 AM3/2/17

to alta...@googlegroups.com

Interesting John. Big data seems to crop up all over these days from marketing information - your Tesco club card - to gene analysis, accountancy forensic audit, counter terrorism and criminal investigations, population censuses, not to mention Amazon and Google predicting what you want to buy next or find out about. I wonder if this is a subject area where a cross disciplinary approach would be extremely useful. Much reinventing of the wheel otherwise, many of which won't be quite as round as they could be.

As Hilbert the mathemetician is meant to have said after he was consulted by Einstein on general relativity and then published ahead of him, Physics is too important to be left to physicists.

James

Sent from my iPad

--
You received this message because you are subscribed to the Google Groups "Altair_B" group.
To unsubscribe from this group and stop receiving emails from it, send an email to altair_b+u...@googlegroups.com.
To post to this group, send email to alta...@googlegroups.com.
Visit this group at https://groups.google.com/group/altair_b.
For more options, visit https://groups.google.com/d/optout.

John Murrell

unread,

Mar 2, 2017, 4:21:26 AM3/2/17

to alta...@googlegroups.com

Hello James,

The challenge is that we are moving from the era where you could download the data to your local machine and then analyse it back to the era where the computing has to be done where the data is stored. Part of the challenge is how the CPU time and the data storage is paid for. At present ESA seem willing to allow people ½ hour free usage per query.

I did some analysis of the original Hipparcos data for a project in 1999 – getting the data was relatively easy one just sent some money to ESA and they sent you a book describing the data and a folder of CD-ROMS. The problem then was analysing the data as the size of the data set was at or beyond what PCs could handle at the time.

Now I can download the data over the internet, analyse it and print a graph of the results in well under an hour.

However the GAIA data set is so large the internet is not wide / fast enough to deliver it and the cost of local storage is probably prohibitive. SKA & LSST will be even worse of course.

It is interesting that I would not be able to run the same analysis I did on the Hipparcos data on the Gaia data as ½ hour is nowhere near long enough to run the query. From what I can see of other queries being run on the Gaia data they are looking at very small areas, the indexing must make extracting the data for small areas more efficient. That’s where the skill in the information management is required to minimise the amount of CPU and disk time.

The query I eventually ran to calculate the number or stars observed by Gaia relied on the provision of a random index – selection 0.1% of the data using the random index should give me ‘approximate’ statistics for the whole data set – assuming that the index is in fact random !

Challenging times ahead

John

J R

unread,

Mar 3, 2017, 5:23:46 AM3/3/17

to alta...@googlegroups.com

Hi John

You are right. All in the indexing. One audit I was involved in produced a database with approaching half a million emails in no particular order, this being a data dump of everything across the large organisation between two dates and much of it irrelevant. This was the best they could do. They said!

People cleverer than I used linking index information such as date, subject and recipients to recreate much order from chaos. Back in the day paper files clustered relevant information in chronological order and file registries allowed you to do quick searches for the clusters.

Not these days in my experience of many electronic filing systems, evidently now including star catalogues. As with any taxonomy facilitating meaningful data sub sets, the tricky part is finding the optimal indexing and data linking classifications. Never perfect though. As a young boy sent out to buy birthday cake candles from Woolworths, I foolishly went to the cooking section but was redirected to the household section where the bigger candles were stored.