Using d3 to visualize the k-means clustering algorithm

1,985 views
Skip to first unread message

Eric Scott Bullington

unread,
Feb 4, 2012, 8:00:31 PM2/4/12
to d3-js
I'd like to enlist the group's help with a visualization I'm creating
with the help of d3 of a machine learning algorithm called k-means
clustering. I'm in the process of creating visualizations for several
machine learning algorithms that I plan to publish on my blog, but
this particular visualization has not come out like I had hoped. So
I'm hoping to get some good advice here on how to improve the
animation and display of data. I'm pretty new to d3, and I've not
found much material published yet on best practices for animated
transitions, so I'm sure my visualization has a lot of room for
improvement.

The k-means clustering implementation I'm currently using is based on
a solid JS machine learning library by Heather Arthur, but I plan to
eventually switch it out for another implementation I'm writing that
is modeled after the k-means algorithm in the Python sci-kit learn
library.

You can see the visualization in action here:

http://jsfiddle.net/esbullington/Wyjgh/

A full-screen version is here:

http://jsfiddle.net/esbullington/Wyjgh/embedded/result/

My original commented coffeescript is here:

https://gist.github.com/1739860
(I originally tried to display the script with the wonderful
bl.ocks.org, but it chokes on my js, which is admittedly not very
optimized).

A decent explanation of k-means clustering can be found in wikipedia:

http://en.wikipedia.org/wiki/K-means_clustering

A brief description of my visualization before you begin (it's a bad
sign that I feel like I have to explain it first, I know): k-means is
all about grouping data, and you're looking at the algorithm as it
goes through multiple iterations, and as it grows closer to properly
clustering the data into groups. The k-means algorith first spits out
a (very complex) nested Javascript array, then d3 runs through and
displays the iterations (did this instead of dynamically displaying
the algorithm for reasons too tedious to go into here).

My primary question is: how to improve the visualization? This did
not come out like I had imagined it. The main problem is that the
clusters change only around the very edges, and it's very hard to
perceive that the cluster membership of certain nodes is changing as a
result of the new centroids calculated (the big red circles). I need
to find a way to highlight that these nodes are changing from
iteration to iteration.

Any suggestions on how to better visualize this, given the particular
way I have chosen to do so?

Some particular questions on the implementation:

1. As mentioned above, I would like to visually emphasize those nodes
that change cluster membership from one iteration to the next. But
how to do this? From what I understand, d3 does not cache data. Is
the only way for me to do this dynamically to put the setInterval()
timer in its own loop, and cache the needed data in the loop from one
iteration to the next. I suppose I could also create a new category
of data to track whether or not a node changes membership from the
previous iteration, but this will be my last resort.

2. Is the transition method the only method that allows the display
of nodes to be delayed? I'd like to delay the changing cluster
membership by about a half-second after the red centroids are
displayed, but I don't want to animate their transition. Any way I
can just delay that without a transition?

Thanks in advance for reading all this and I look forward to your
input.

Eric

Kai Chang

unread,
Feb 4, 2012, 8:04:25 PM2/4/12
to d3...@googlegroups.com
It's kind of tough to track the centroid. Maybe a line connecting each subsequent iteration, and the newest centroid as slightly larger or a different color?

Ian Johnson

unread,
Feb 5, 2012, 2:17:52 AM2/5/12
to d3...@googlegroups.com
First of all I'd like to say I really like the direction you are going with this, I think visualizing algorithms can make a huge difference for people trying to learn them, especially for complex and abstract topics like data mining!

Without looking at the code I have a couple suggestions you may try.

1) I would suggest you have an animation for each node that changes membership like so: if a circle changes from blue to green you have the circle grow and transition the color then shrink back down, maybe have a semi-transparent ring (of the original color) grow out of it and fade away. This will visually emphasize the change while giving the viewer time to process the change (an instant color flip is hard to follow) Also when looking at the visualization at a distance you should see how the centroids moving affects the groupings.

2) I think what you want to do here is good, and you should be able to accomplish it by "manually" keeping track of a delay. Instead of setInterval I'd recommend using d3.timer. Whenever a centroid movement happens, you save the current time in a variable, then when the difference between the saved time and the current time is greater than half a second (or whatever your threshold) you trigger your other events.

As to your question about keeping track of which nodes are changing, I would do it the d3 way, bind your data to each node along with it's current grouping. Each time your centroids are updated selectAll your nodes and have each one check against the other centroids to obtain its updated grouping. If the new grouping is different from the old grouping you can trigger the color change animation.

It's a little on the late side of a saturday night for me to get into the code, but I'll check it out soon. very cool, please keep everyone posted!
Ian 

On Sat, Feb 4, 2012 at 5:00 PM, Eric Scott Bullington <erictra...@gmail.com> wrote:



--
Ian Johnson

Eric Scott Bullington

unread,
Feb 5, 2012, 9:23:42 AM2/5/12
to d3-js
Thank you for your response, Kai. I think a line connecting each
iteration might work well. I'll try it out and see. I don't want
them to change colors since I want to be able to have a key that
indicates that the centroids are a certain color/shape to be easily
recognizable.

Eric Scott Bullington

unread,
Feb 5, 2012, 10:22:20 AM2/5/12
to d3-js
Thanks so much for the detailed feedback and encouragement, Ian. You
made some excellent suggestions:

> 1) I would suggest you have an animation for each node that changes
> membership like so: if a circle changes from blue to green you have the
> circle grow and transition the color then shrink back down, maybe have a
> semi-transparent ring (of the original color) grow out of it and fade away.
> This will visually emphasize the change while giving the viewer time to
> process the change (an instant color flip is hard to follow) Also when
> looking at the visualization at a distance you should see how the centroids
> moving affects the groupings.

This would be really cool. I'm not completely sure I know how to do
this in d3 yet. Do you know of any d3 examples that do something
similar?

What do you think about animating each cluster so that the cluster
members converge into the new centroid, thus representing the
averaging? Again, I'd need to think about how to pull this off in d3,
but do you think that would work?

>
> 2) I think what you want to do here is good, and you should be able to
> accomplish it by "manually" keeping track of a delay. Instead of
> setInterval I'd recommend using d3.timer. Whenever a centroid movement
> happens, you save the current time in a variable, then when the difference
> between the saved time and the current time is greater than half a second
> (or whatever your threshold) you trigger your other events.

I'm not sure that d3.timer can help me here, for reasons that will
become clear when you look at my code. Because I wanted to make a
modular visualization that will work with different k-means
implementations, I'm running the k-means algorithm, recording the
output on each iteration, and then creating a complex nested array. I
then pass that array to d3 for the visualization. So I need a way to
insert delays into my d3 script, and I don't think d3.timer can help
me with that (unless I'm missing something). I *could* modify the
visualization to make it dynamic, displaying the results as the
algorithm computes them, but I'd still need to use setInterval() or
setTimeout() to insert delays. But if I did that, your suggestion
would be a good way to synchronize the visualization with the
algorithm.

On retrospect, changing the visualization to run as the algorithm is
computed may be the best path to take. I'll think about it.

> As to your question about keeping track of which nodes are changing, I
> would do it the d3 way, bind your data to each node along with it's current
> grouping. Each time your centroids are updated selectAll your nodes and
> have each one check against the other centroids to obtain its updated
> grouping. If the new grouping is different from the old grouping you can
> trigger the color change animation.

This is exactly what I'd like to do, but I haven't yet figured out how
to do it in d3. Do you think you could please provide a little
(pseudo)code, or point me to an example? The data I'm binding to the
nodes includes the current grouping. But I don't know how to check
against the old grouping with each iteration the "d3" way. Is there
an easy way to do this with d3? Like I said, I could put the redraw()
function in a loop with setTimeout in it and cache the data passed in
accordingly. But I'm not sure how to do this with d3.

> It's a little on the late side of a saturday night for me to get into the
> code, but I'll check it out soon. very cool, please keep everyone posted!

I think once you see the code, you'll better understand the issues I'm
running up against. Given your input and my own thoughts, I am now re-
considering my decision to first run the algorithm, and then the
visualization. In light of what I'm trying to do, it may be easier to
run the visualization dynamically. So thanks again for your excellent
input!

Mike Bostock

unread,
Feb 5, 2012, 12:31:27 PM2/5/12
to d3...@googlegroups.com
You could overlay a Voronoi diagram to show the boundaries of each
cluster more distinctly:

http://www.youtube.com/watch?v=gSt4_kcZPxE

Mike

Eric Scott Bullington

unread,
Feb 5, 2012, 10:04:31 PM2/5/12
to d3-js
Thanks, Mike. That's given me several ideas.
Reply all
Reply to author
Forward
0 new messages