kmeans clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The unsupervised k-means algorithm has a loose relationship to the k-nearest neighbor classifier, a popular supervised machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the k-means algorithm"; it is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "nave k-means", because there exist much faster alternatives.[6]
The objective function in k-means is the WCSS (within cluster sum of squares). After each iteration, the WCSS decreases and so we have a nonnegative monotonically decreasing sequence. This guarantees that the k-means always converges, but not necessarily to the global optimum.
The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.
The below pseudocode outlines the implementation of the standard k-means clustering algorithm. Initialization of centroids, distance metric between points and centroids, and the calculation of new centroids are design choices and will vary with different implementations. In this example pseudocode, argmin is used to find the index of the minimum value.
Commonly used initialization methods are Forgy and Random Partition.[10] The Forgy method randomly chooses k observations from the dataset and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al.,[10] the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. A comprehensive study by Celebi et al.,[11] however, found that popular initialization methods such as Forgy, Random Partition, and Maximin often perform poorly, whereas Bradley and Fayyad's approach[12] performs "consistently" in "the best group" and k-means++ performs "generally well".
The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is 2Ω(n).[13] These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.[14]
On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.[20]
Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the nave implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm.[9][22][23][24][25]
Finding the optimal number of clusters (k) for k-means clustering is a crucial step to ensure that the clustering results are meaningful and useful.[26] Several techniques are available to determine a suitable number of clusters. Here are some of commonly used methods:
Hartigan and Wong's method[9] provides a variation of k-means algorithm which progresses towards a local minimum of the minimum sum-of-squares problem with different solution updates. The method is a local search that iteratively attempts to relocate a sample into a different cluster as long as this process improves the objective function. When no sample can be relocated into a different cluster with an improvement of the objective, the method stops (in a local minimum). In a similar way as the classical k-means, the approach remains a heuristic since it does not necessarily guarantee that the final solution is globally optimum.
Different move acceptance strategies can be used. In a first-improvement strategy, any improving relocation can be applied, whereas in a best-improvement strategy, all possible relocations are iteratively tested and only the best is applied at each iteration. The former approach favors speed, whether the latter approach generally favors solution quality at the expense of additional computational time. The function Δ \displaystyle \Delta used to calculate the result of a relocation can also be efficiently evaluated by using equality[44]
A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable so that the mean converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment. When for example applying k-means with a value of k = 3 \displaystyle k=3 onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With k = 2 \displaystyle k=2 , the two visible clusters (one containing two species) will be discovered, whereas with k = 3 \displaystyle k=3 one of the two clusters will be split into two even parts. In fact, k = 2 \displaystyle k=2 is more appropriate for this data set, despite the data set's containing 3 classes. As with any other clustering algorithm, the k-means result makes assumptions that the data satisfy certain criteria. It works well on some data sets, and fails on others.
k-means clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's algorithm. It has been successfully used in market segmentation, computer vision, and astronomy among many other domains. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.
Vector quantization, a technique commonly used in signal processing and computer graphics, involves reducing the color palette of an image to a fixed number of colors, known as k. One popular method for achieving vector quantization is through k-means clustering. In this process, k-means is applied to the color space of an image to partition it into k clusters, with each cluster representing a distinct color in the image. This technique is particularly useful in image segmentation tasks, where it helps identify and group similar colors together.
Example: In the field of computer graphics, k-means clustering is often employed for color quantization in image compression. By reducing the number of colors used to represent an image, file sizes can be significantly reduced without significant loss of visual quality. For instance, consider an image with millions of colors. By applying k-means clustering with k set to a smaller number, the image can be represented using a more limited color palette, resulting in a compressed version that consumes less storage space and bandwidth. Other uses of vector quantization include non-random sampling, as k-means can easily be used to choose k different but prototypical objects from a large data set for further analysis.
Cluster analysis, a fundamental task in data mining and machine learning, involves grouping a set of data points into clusters based on their similarity. k-means clustering is a popular algorithm used for partitioning data into k clusters, where each cluster is represented by its centroid.
However, the pure k-means algorithm is not very flexible, and as such is of limited use (except for when vector quantization as above is actually the desired use case). In particular, the parameter k is known to be hard to choose (as discussed above) when not given by external constraints. Another limitation is that it cannot be used with arbitrary distance functions or on non-numerical data. For these use cases, many other algorithms are superior.
3a8082e126