Wednesday, 20 December 2017

What is K-Means in Clustering in Machine Learning?

We are going to explain the basic concept of k-means clustering and the k-means clustering algorithm.

What is k-Means Clustering
It is a clustering method that tends to partition your data into partitions called clusters. Let's say you have you have n data points, the k-means algorithms would assign each of the data point to the cluster with the nearest mean.

Note: the k in k-means represents the number of clusters, that is k clusters.

How it works
Suppose we have a set of observations {x1, x2,... xn} which consists in a set of N random variable x (x is a D d-dimensional real vector). The goal is to partition the data set  into some number K of clusters, where the value of K is known.
A cluster is a group of data points whose inter-point distances are minimal when compare with distance to points outside the cluster.
The first step is to find the mk, for k = 1,..., K, in which mk is the mean associated to the kth cluster.
We now assign each of the data points to clusters, such that the sum of squares of the distances of each data  point to its closest mean mk is  minimum.
Steps of the k-Means Algorithm
Step 1: For  each unit, x, in the input space, place it in the cluster whose current centroid it is nearest to.
Step 2: After all points have been assigned (x1,...xn), adjust the locations of the centroids of the k clusters
Step 3: Reassign all the points to their nearest centroid
Repeat steps 2 and 3 until convergence
Convergence occurs when the points not move between clusters and the centroids are stabilized

This algorithm would become clearer when we consider an example later in this article.
Let's now look at a little more details on how the clusters are assigned

Assignment of Clusters
 (based on "Pattern Recognition and Machine Learning" by Christopher M. Bishop)
For each data point xn, we would use a corresponding set of binary variables we denoted as rnk ={0,1} where k =1,...,K.
rnk describes which of the K clusters the data point xn is assigned to. If rnk is assigned to k, then rnk = 1 and rnj = 0 for j not equal to k.

Note the the two superscript in   indicates that there would be two loops:
Loop 1(1-N): Iterate through all the data points
Loop 2(1-K): Iterate through all the cluster

Now, for each iteration, you need to calculate a value for J which is called the distortion measure and is given by the sum of squares function:

Lets try to understand this formula
First note that this formula is sometimes called distortion measure since it represent how much each data point say xk is separated from the mean mk of that cluster.
In other words, J represents the sum of squares of the distance of each data point to its assigned vector mk.
  • N is to total number of data points,
  • K is the number of clusters
  • xn is the vector of measurement n
  • mk is the mean for cluster k
  • rnk is an indicator variable that indicates whether to assign xn to k
We need to determine the value of {rnk} and mk that gives the least value of J.
This is achieved by the use of k-Means Algorithm

More Explanation of the k=Means Algorithm
This algorithm is an iterative procedure involving steps:
Input set of points x1,...xn
Choose a value for K
Place m1,..., mk(let's call this centroid) are random locations within your data space
Repeat the following steps until convergence
for each point x
  • find the nearest centroid, cj (compute the distance  between xi, compute the distance between xi and mj for every centroid mj, and pick the cluster with the minimum distance)
  • find the nearest centroid
  • assign this point xj to the cluster of the nearest centroid
For each cluster  j = 1,..., K, and recompute the centroid position
  • Take all the data points that fall into this cluster
  • new centroid cj = mean of all points x
  • assign to cluster j in the previous step
Stop when none of the cluster assignments change

 Example of k-Means Clustering
The example below is based on
K = 2 and N = 14
Take some time to examine the progression through the steps and make sure you understand how it works