Why are items in multiple clusters in the K-Means chapter for K=10?

23 views
Skip to first unread message

Tomas Jansson

unread,
Aug 16, 2015, 3:10:16 PM8/16/15
to Machine Learning Projects for DotNET Developers
In the section where you find the "optimal" number of clusters I find the end result a little bit confusing. I thought that "items" could only be in one cluster, but in the final result here we have

javascript in cluster 2, 3, 4
html in cluster 2, 4
jquery in cluster 3, 4
mysql in cluster 2, 9

Didn't the algorithm take add the items to the closes centeroid? How do items end up in multiple clusters?

Mathias Brandewinder

unread,
Aug 16, 2015, 5:15:41 PM8/16/15
to Machine Learning Projects for DotNET Developers
Thanks for the question, Tomas!

First, you are correct - items can only belong to one cluster. I think the problem is that I probably wasn't clear enough on what an item is.

In our context, items are "StackOverflow users", so an individual user will belong only to one cluster of users who have a similar profile; in this case this means that a cluster groups users who tend to use the same tags a lot, and also tend to be inactive in the same set of tags.

Note that items are users, and not tags. Two very different groups of users could be sharing a high level of activity in the same tag. A tag doesn't belong to one cluster: a tag might be used unusually high for one or more clusters.

I made up a small, fictional example below, to illustrate how that could work. Suppose I just had 2 tags, SQL and mySQL. I am going to assume that the world contains 3 types of people: people who don't care about databases, and people who do - and for people who care about databases, they are split between mySQL users, vs, say, SQL Server, with no overlap. It's a bit exaggerated, but hopefully not entirely unrealistic. 

In this case, this is what I would see: 3 clusters of individuals.
- on the left/middle you have people who don't care. Both SQL and mySQL is low
- on the right/top you have the mySQL DBAs: their usage of mySQL is high, but they also use the tag SQL a lot.
- on the right/bottom, you have the non-mySQL DBAs: their usage of mySQL is low, but the also use the tag SQL a lot.

In that case, both clusters on the right share a high usage of the SQL tag, but they are very different on another dimension (mySQL); an individual user will be assigned to only one cluster, but multiple clusters can have a high usage in the same tag.


I hope this helps! Thanks again for the question,


Mathias


Reply all
Reply to author
Forward
0 new messages