Kmeans with custom distance calculation

Luis Ramada Pereira

unread,

Dec 18, 2015, 2:53:12 PM12/18/15

to

Matlab noob here... so bear with me please.

I need to cluster line segments instead of points where the distance between two segments is the sum of the distance between their endpoints and the centroid segment minimises the sum of the total distances between all start and all end points in a cluster.

Is it possible to overload the matlab functions that compute these? How would I go about it?

TIA
Luis

someone

unread,

Dec 18, 2015, 3:07:12 PM12/18/15

to

"Luis Ramada Pereira" <m4...@iscte.pt> wrote in message <n51o6v$4ni$1...@newscl01ah.mathworks.com>...

> Matlab noob here... so bear with me please.
>
> I need to cluster line segments instead of points where the distance between two segments is the sum of the distance between their endpoints and the centroid segment minimises the sum of the total distances between all start and all end points in a cluster.

WOW!!! That's a mouthful. Can you provide or show an example? How many line segments are you talking about? 1D, 2D, or 3D?

>
> Is it possible to overload the matlab functions that compute these?

Which functions?

> How would I go about it?

That would be a bad idea in general (especially if you are a "MATLAB noob"). You would be better off writing your own function (i.e. myplot) that (if need be) in turn calls the MATLAB function (i.e. plot). That way if you still need to use plot for something else, you can.
>
> TIA
> Luis

Luis Ramada Pereira

unread,

Dec 18, 2015, 8:19:14 PM12/18/15

to

I don't understand why the number of line segments or dimensions matter. But we're talking 2D and hundreds of thousands of line segments.

Kmeans works by computing the cartesian distance between points. What I want is to replace points with line segments calculating the distance between two segments as the sum of the distance of the start points and the end points. i.e. if a line segment is ((1,1),(2,2)) and the centroid is ((1,2),(2,3)), the distance to the centroid would be 2. The centroid would be the segment that starts at the average of all start points in the cluster and ends at the average of all the end points of the cluster.

Kmeans is not that hard to implement, but with a large data set I'd rather use as much as the native functions as possible.

Thanks.
Luis

Peter Perkins

unread,

Dec 22, 2015, 11:16:53 AM12/22/15

to

KMEANS (the function) _doesn't_ work in just that way. As described in
the doc, it works with several different distance measures. The question
sometimes comes up, "Why doesn't it work with more? Why can't it work
with this particular one?"

The answer is that the K-Means algorithm does two things that involve
distances:

1) compute the distances from each point to each centroid
2) compute cluster centroids as the point that minimizes the sum of
point-to-centroid distances within each cluster

"The point that minimizes the ... distances" is simple for some distance
measures, but very hard for others. For _squared_ Euclidean distance,
it's the component-wise arithmetic mean. But even for some distances as
apparently equally simple, e.g. _unsquared_ Euclidean distance, it can
be a very expensive computation.

It's not optimal to compute the centroids as the component-wise
arithmetic mean if you're not using squared Euclidean distance, because
that's not the point that minimizes the sum of distances. There are a
handful of other distances for which the appropropriate centoid
calculation is simple. For example, for city-block distance, it's the
component-wise median. So KMEANS (the function) works with those distances.

It may be that the component-wise arithmetic mean _is_ the appropriate
centroid for your distance. Or pragmatically, it may be "good enough".
You can write a simple version of K-Means that does that in only a
couple dozen lines of code. Or you can use the KMEANS function as a
starting point.

Hope this helps.

Luis Ramada Pereira

unread,

Dec 25, 2015, 7:48:11 AM12/25/15

to

Thanks for all the answers.

I think the solution to my problem (how to cluster line segments) was staring me in the face.
If I cluster points in 4D space where the coordinates for each point (a,b,x,y) are the coordinates for endpoints (a,b) and (x,y) in a 2D space line segment, I get what I need. After clustering I just need to transform 4D points back to 2D line segments. Made a few tests and it seems to be working. Any mathematicians out there to fault this logic?

Luis

Luis Ramada Pereira

unread,

Jan 20, 2016, 1:27:09 AM1/20/16

to

Well, I was wrong.

In the end I do have to change the distance calculation in the K-means function.

I have found two K-means functions in the Mattlab libraries, one in toobox/stats/stats/ the other in toolbox/stats/eml. I have used the one in stats/stats. Which one should I use as a basis?

With p=4, changing the distance calculation from the standard sqeuclidean:

D(:,i) = (X(:,1) - C(i,1)).^2;
for j = 2:p
D(:,i) = D(:,i) + (X(:,j) - C(i,j)).^2;
end

to

D(:,i) = sqrt((X(:,1) - C(i,1)).^2 + (X(:,2) - C(i,2)).^2)+sqrt((X(:,3) - C(i,3)).^2 + (X(:,3) - C(i,4)).^2);

(and doing the same in cluster membership reassignment code)

it's taking forever to compute... I have got a million points and seems to be going for quite a while now... With the standard function it was slow, but completed in reasonable time.

Thanks for your help.
Luis

Luis Ramada Pereira

unread,

Jan 20, 2016, 3:10:08 AM1/20/16

to

Researched a little bit and for some reason it's not converging. The sum of distances is decreasing, but points keep on getting moved around in phase 2 up to 1000 iterations.

To be more precise, I've made 2 changes:

In the distfun function I changed:

D(:,i) = (X(:,1) - C(i,1)).^2;
for j = 2:p
D(:,i) = D(:,i) + (X(:,j) - C(i,j)).^2;
end

to

D(:,i) = sqrt((X(:,1) - C(i,1)).^2 + (X(:,2) - C(i,2)).^2)+sqrt((X(:,3) - C(i,3)).^2 + (X(:,4) - C(i,4)).^2);

and at the beginning of phase 2 in calculating distances to each cluster from each point I changed:

Del(:,i) = (m(i) ./ (m(i) + sgn)) .* sum((bsxfun(@minus, X, C(i,:))).^2, 2);

to:

Del(:,i) = (m(i) ./ (m(i) + sgn)) .* (sqrt(sum((bsxfun(@minus, X(:,1:2), C(i,1:2))).^2,2))+sqrt(sum((bsxfun(@minus, X(:,3:4), C(i,3:4))).^2,2)));

my points have 4 dimensions, and if one point is (x1,y1,w1,z1) and another (x2,y2,w2,z2), then the distance that I want to use is sqrt((x1-x2)^2+(y1-y2)^2) + sqrt((w1-w2)^2+(z1-z2)^2)

Would I need to change the K-means function somewhere else for this to work?

Would really appreciate the help of anyone's familiar with the builtin k-means function, as I am stuck.

Luis Ramada Pereira

unread,

Jan 20, 2016, 2:47:11 PM1/20/16

to

I think what I am doing wrong is updating the new centroid location as members go in and out of clusters.

Peter (poster above) was right. This may not be trivial.

So my problem now boils down to: How do you calculate the centroid in a cluster where distances are measure in nonsquared euclidean distances?

Luis

Peter Perkins

unread,

Jan 20, 2016, 9:19:50 PM1/20/16

to

https://en.wikipedia.org/wiki/Fermat%E2%80%93Weber_problem

Luis Ramada Pereira

unread,

Jan 21, 2016, 9:01:11 PM1/21/16

to

Thanks Peter... I see this will make running K-means an expensive proposition. I'll go with cityblock. The computation for the distance between two 2D line segments (defined as the sum of the distance between its end points) is the same as the the distance between 2 4D points, ie. sum(abs(xn-yn))... so this will do and perform fast.

Luis