Synthetic training samples

23 views
Skip to first unread message

Colin Navin

unread,
May 24, 2016, 2:17:27 AM5/24/16
to astroML-general
Hi all

Sorry this is probably a naive question, I'm a newbie to machine learning.

I'm interested in finding clusters of stars in survey datasets. The data sets I am looking at are relatively large (the smallest is about 70,000 stars) with various combinations of measurements and errors (positions, magnitudes, radial velocities, stellar parameters, metallicities, chemical abundances etc etc).

I was looking at using support vector machine or k-nearest-neighbour methods to find my cluster members. However I find that the datasets often have only a few (or no) known members to use as a training set. Even if they do have a few known members the distributions of the parameters of interest can often be skewed from the literature values of the cluster (presumably in part because of the small numbers). So normally in this case I guess you would move on to unsupervised methods?

However, the literature values and dispersions of the cluster parameters are mostly well established and the data errors are quoted. I think it would be feasible to construct expected pdfs of radial velocities etc of cluster stars and sample them to produce a training sample. I was wondering if it is a reasonable approach to produce a "synthetic cluster" sample to use as a training set? Or is this going against some fundamental principle in data analysis?

cheers
Colin

Jake Vanderplas

unread,
May 24, 2016, 12:41:56 PM5/24/16
to Colin Navin, astroML-general
Hi Colin,
When you are relying on synthetic training samples, there are often better approaches to use than to do supervised learning. Any supervised method trained on such samples and applied to real-world data can never exceed the accuracy of the synthetic model itself, and in most cases will be slightly worse. With that in mind, it's usually much better to use the synthetic modeling process *directly* as a forward-model of your data.

For forward-modeling, I find Bayesian approaches to be most useful. If you need a refresher on these, you might check out the material from the workshop on that subject that I gave at the last AAS conference: https://github.com/jakevdp/BayesianAstronomy

Hope that helps, and best of luck!
   Jake

--
You received this message because you are subscribed to the Google Groups "astroML-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astroml-gener...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Colin Navin

unread,
May 25, 2016, 7:30:26 PM5/25/16
to astroML-general, co...@navinator.net
Hi Jake
Thanks for that, I wasn't even sure if the idea of a synthetic sample was a thing. I will have a look at forward modelling and the workshop material.
cheers
Colin
Reply all
Reply to author
Forward
0 new messages