clustering algorithms

83 views

Skip to first unread message

Antonio Piccolboni

unread,

Mar 26, 2013, 10:43:28 PM3/26/13

to rha...@googlegroups.com

Hi,

in the dev branch of rmr2 under pkg/examples I added an initial implementation of a single pass clustering algorithm based on mclust. The idea, inspired by similar ongoing work in the Mahout project, is to cluster data as early as possible in reasonable chunks and then merge the results. With a generative model like the one generated by mclust, one merger strategy is pretty obvious: simulate datasets from each of the partial clusters, take the union and cluster again. I found mclust hard to use though or downright incomplete. It searches for the best number of clusters using exhaustive search between 1 and 9. Specifying a different range doesn't seem to work very well, at least the plots are incomplete and anyway a complete algorithm should not rely on magic parameters. Imagine if a software engineer delivered a sorting algorithm that works up to length 9 or needs additional input. If anybody has experience with mclust or can suggest a clustering algorithm of wide applicability, efficient and model based that would be very helpful. Also if anybody is working on clustering large datasets and wants to take this algorithm for a spin (as soon as the nagging issues are fixed, right now I can break it by myself) I'd appreciate the feedback.

Antonio

Antonio Piccolboni

unread,

Mar 27, 2013, 6:32:13 PM3/27/13

to rha...@googlegroups.com

Well, apparently mclust and I don't get along very well, but I made some progress in two directions

1. When an algorithm fails, build a framework. Since frameworks can't run, they can't fail. So I wrote a generic mapreduce clustering function that takes a clustering function and cluster merging function as arguments.

2. Since the above is cheating, I instantiated the framework based on the clustering function clara, which, unlike mcclust, works. That made the mapreduce version work as well. Unfortunately, one needs to specify the number of clusters in advance and the compute time seems to rise quite drastically with the number of clusters.

The file is cluster.mr.R under examples in the branch dev for now. That alone should be a reminder that this code is highly experimental. Enjoy