Hi all
Sorry this is probably a naive question, I'm a newbie to machine learning.
I'm interested in finding clusters of stars in survey datasets. The data sets I am looking at are relatively large (the smallest is about 70,000 stars) with various combinations of measurements and errors (positions, magnitudes, radial velocities, stellar parameters, metallicities, chemical abundances etc etc).
I was looking at using support vector machine or k-nearest-neighbour methods to find my cluster members. However I find that the datasets often have only a few (or no) known members to use as a training set. Even if they do have a few known members the distributions of the parameters of interest can often be skewed from the literature values of the cluster (presumably in part because of the small numbers). So normally in this case I guess you would move on to unsupervised methods?
However, the literature values and dispersions of the cluster parameters are mostly well established and the data errors are quoted. I think it would be feasible to construct expected pdfs of radial velocities etc of cluster stars and sample them to produce a training sample. I was wondering if it is a reasonable approach to produce a "synthetic cluster" sample to use as a training set? Or is this going against some fundamental principle in data analysis?
cheers
Colin