GMM from mean and Stdev

20 views
Skip to first unread message

Ryan

unread,
Sep 16, 2014, 5:10:28 PM9/16/14
to accor...@googlegroups.com
Hello, 

I apologize for the basic/simplicity of this question but I am still trying to grasp the fundamentals of the math and the framework.

I currently have a  mean and a corresponding standard deviation for a list of persons.  From this data, I am trying to create a hierarchical GMM to find top, middle, and bottom performers. 

I understand I can make an array of Normal Distribution objects using the mean and stdev. 
From there, I create a Mixture<NormalDistribution> with the array.

However, at this point I am lost as to where to go from here.  The example code as well as the constructors for the GaussianMixtureModel or GaussianCluster class all seem to assume you have the samples?

Any guidance would be much appreciated.

--Ryan

César

unread,
Sep 17, 2014, 4:59:33 AM9/17/14
to accor...@googlegroups.com
Hi Ryan,

First and foremost, thanks for the interest in the framework! Please, can you elaborate a bit more what you are trying to do? What do you have, and what is exactly the goal?

For example, please tell a bit more about which kind of data do you have. Do you have a list of <mean, stdDev> pairs for each of the persons? Let's say that you have 50 persons. Does this mean that you would have 50 different means and 50 different standard deviations? What do those means represent?

If I understood your data correctly, I might be able to give you a suggestion; but I am not sure if my suggestion would have anything to do with a hierarchical Gaussian mixture model. I am not sure how to apply the hierarchical part in such case. But in any case, here it goes:

If you just want to find the worst, median and best performers with the data you have, perhaps one option would be to do just like you said: create 50 NormalDistribution objects, and store them in an array. Afterwards, pass this array to initialize a Mixture<NormalDistribution>. Afterwards, check the distribution's quartiles by using

DoubleRange quartiles = mixture.Quartiles;

The quartile interval will give you an indication on how to separate your dataset. People whose performance is between the quartiles.Min and quartiles.Max could be considered average as they would compose roughly 50% of your data. People whose performance is lower than quartiles.Min would be the worst, and people whose performance is higher than quartiles.Max would be the best ones. 

However, for practical purposes it might also be more interesting to consider a different percentile rather than the quartiles. If you wish, you can call the GetRange function passing another value, such as 0.05, as its argument. Using 0.05 would return the interval in which 95% of the average person's performances would be. You could consider that anything below or higher this interval would be truly poor or exceptional, as events occurring this far from the distributions are traditionally considered unlikely.

Hope it helps! If I didn't understand your question properly, please let me know.

Best regards,
Cesar

Ryan

unread,
Sep 17, 2014, 9:26:33 AM9/17/14
to accor...@googlegroups.com
Thank you Cesar for the reply.


My data is currently a ranking for a sports player for the up coming match.  The rankings are currently determined based upon the consensus of a number of experts.  So, as an example my data would look like this:

Player   |  Best Rank By Any Expert   | Worst Rank By Any Expert  |   Average Rank   |   Std Dev  

Player 1     1                                             5                                        1.7                       0.8
Player 2     1                                            18                                       2.3                       1.6
Player 3     1                                            15                                       3.0                       1.3        
.
.
.

The goal would be to figure out "how much better" or what is the distance from Player 1 to Player 2 based on this data. 

An example of this has been done which I am trying to re-create for self teaching purposes here --
http://www.nytimes.com/2013/10/11/sports/football/turning-advanced-statistics-into-fantasy-football-analysis.html
and 
http://www.borischen.co/
and an example of the data can be found here: http://www.fantasypros.com/nfl/rankings/rb-cheatsheets.php


Your suggestion for the Quartiles is a good one and I believe might achieve the same goal.  I conceptually understand both of these concepts but cannot offer an explanation as to why the author of the NY Times article decided to do GMM instead of Quartiles. 


Thank you again for the help, I have learned a great deal in a very short period of time with this project.

--Ryan





Reply all
Reply to author
Forward
0 new messages