K means Clustering Algorithm

Clustering is the grouping together of observations that share common features. The distinction between classification and clustering is that with classification, we tend to have the response variable \(y\) , that is, we already know the class to which our training data set observations \(X\) belong to. We then try to extract information from the features/predictors \(X\) to understand as to why the observations are classified as they are in \(y\). Then from this “learned” knowledge/model we try to determine what class a new observation \(X_{new}\) belongs to, carry out prediction.

For clustering, we do not have the response variable \(y\). We just have data, which seems to show some grouping structure. We then try to learn some knowledge about the data, in order to be able to group a new observation.

Suppose we have the data as shown in the image below

From the image on the left, we can note that there exists some grouping in the data. To our human eye, we can easily tell that there are 2 clusters. Given a new observation, depending how do we determine which cluster to put it in? How can we tell the computer to do the same?

We were able to claim the presence of two cluster. This is due to how the points are distanced from one another. Points close together will be grouped together. Naturally, we will pick two points which we will consider to be the center of the clusters. Then points will be clustered according to their distance to these centers. A point \((x,y)\) will be clustered to cluster 1 if its distance to the center of cluster 1 is smaller than its distance to the center of cluster 2. This can be shown in the graph below:

The question remains how to obtain the optimal centers. Note that we classify a point to the nearest center. Meaning all the distances obtained from an optimal clustering are minimal. The sum of all these distances will be the minimal.

Any other centers other than the optimal centers will have larger distances.

Thus to obtain the optimal grouping with\(k\) number of clusters we start from any \(k\) random centers, and we compute the distance. We cluster the points to the closest center. Then we compute the mean of the points in each cluster. These computed means become the new centers. We then compute the distances and re-cluster the points to the nearest centers and compute the means. We repeat the process until there is no change in the means. That means that the optimal grouping has been achieved.

Lets take a look at an example using the faithful dataset, assuming 2 clusters.

First we write a function that computes a distance from 1 point to the rest of the data, making use of Vectorization feature.

Example 1

distance <- function(a, b){
  if(is.null(dim(b))) b <- t(b)
  colSums((t(b) - c(a))^2)
}

set.seed(1)
# Scale your data
data <- scale(faithful)

# Pick two  points randomly to form the two centroids:
centers <- data[sample(nrow(data), 2),]
i <- 0
repeat{
  i<- i + 1
  #compute the distances between the points and the centers:
  dist <- apply(centers, 1, distance, data)
  # cluster the points and obtain new means
  grp <- max.col(-dist)
  new_centers <- apply(data, 2, tapply, grp, mean)
  
  # Check if the current centers is the same as the previous centers
  # If so break 
  if(sum(abs(new_centers - centers))<1e-4) break
  
  #set the previous centers to be current centers and repeat
  centers <- new_centers
}

data |>
  cbind(data.frame(cluster = factor(grp))) |>
  ggplot(aes(eruptions, waiting, col = cluster)) + 
  geom_point() + 
  coord_fixed() + 
  geom_point(data = data.frame(new_centers), size = 5, color='blue')

We can trace the clustering:

So far we have used intuition to guide in the clustering of points. Can we have a methematical formulation?

We note that this is an optimization problem, with the loss function being the total within sum of squares.

\[ \mathcal{L} = \sum_{k=1}^K\sum_{i=1}^n (x_i - \bar x_k)^2\Large{1}_{x_i\in k} \tag{1}\]

The below explanation has been adapted from cross validated

Given an \(n\) by \(p\) matrix \(X\), the algorithm seeks to group its \(n\) observations, thought of as \(p\)-vectors, into a specified number of groups, \(k\). This can be represented by an \(n\) by \(k\) matrix \(Z\) having entries in \(\{0,1\}\) and one column for each of the \(k\) groups. Column \(j\) indicates which vectors in \(X\) belong to group \(j\); that is, \(z_{ij} = 1\) if and only if observation \(i\) of \(X\) is assigned to group \(j\).

Let \(1_k\) be the column vector of \(k\) 1’s and \(1_n\) the column vector of \(n\) 1’s. \(Z\) is constrained to satisfy \(Z\ 1_k = 1_n\) reflecting the assignment of each observation of \(X\) to exactly one group.

The \(k\) by \(p\) matrix whose rows are the group centroids can be constructed as

\[{\Large{\mu}} = (Z^{\top}\ Z)^{-1} Z^\top\ X \tag{2}\]

The distances between the rows of \(X\) and their associated centroids \(A{\Large\mu}\) are

\[W = X - Z{\Large\mu} \tag{3}\]

also an \(m\) by \(n\) matrix, whence the objective function can be expressed as the number

\[\mathcal{L} = tr(W^{'}\ W) \tag{4}\]

(which is the sum of squares of the entries of \(W\)).

Note that in this case, \(tr(W^{'}\ W)\) is equivalent to \({\large{1}}_n^\top W^{\odot2}\large{1}_k\) where \(\odot\) is the element-wise operator. This is the loss function that need to be minimized. We do not have to compute this as we can stop if no point is reassigned to a different cluster.

For the distance from each element to all the centroids, we can compute it as:

\[ \begin{aligned} d_{i,k} = &(x_i - \mu_k)^\top(x_i - \mu_k)\\ =&x_i'x_i^{} - 2x_i'\mu_k + \mu_k'\mu_k \end{aligned} \tag{5}\]

We could write the function in matrix notation for all points as:

\[ D = X^{\odot2}{\large1}_p{\large1}^\top_k -2X{\Large\mu}^\top+ {\large 1}_n{\large1}_p^\top\left(U^{\odot2\top}\right) \tag{6}\]

As \(\large\mu\) is a function of \(X\) and \(Z\), we could replace it with its formula and defining \(K = XX^\top\) and \(M = (Z^\top Z)^{-1}\) we get:

\[ \begin{equation} D = \textrm{diag}(K){\large1}^\top_k - 2KZM + {\large{1}}_n\textrm{diag}(Z^\top KZMM)^\top \end{equation} \tag{7}\]

A code to depict this:

Example 2

X <- scale(faithful)
k <- 2
n <- nrow(X)
Jk <-rep(1,k)
Jn <- rep(1,n)
K <- tcrossprod(X)
grp <- sample(k, n, TRUE) #start with random groupings

repeat{
  Z <- diag(k)[grp,]
  M <- solve(crossprod(Z))
  A <- K%*%Z%*%M
  D <- diag(K)%*%t(Jk) - 2*A+ Jn%*%t(diag(t(Z)%*%A%*%M))
  new_grp <- max.col(-D)
  if(mean(new_grp == grp) == 1)break
  grp <- new_grp
}
plot(X, col = new_grp)

Writing the distance in the matrix notation enables one to easily use different functions on \(X\).

We just change \(K\) and still use the same formula.

Example 3:

Assume we have data like shown below:

n <- 50
k <- 2
X <- cbind(runif(n,-4,4), sample(c(-1,1), n, TRUE))
ggplot(data.frame(X, col = factor(kmeans(X,k)$cluster)))+
  geom_point(aes(X1,X2, col=col))

The native K-means is unable to cluster the data accordingly.

Its not quite simple on how to separate the data. Suppose we transform the data above to a 3D. ie add one dimension. we get the following:

r <- runif(nrow(X))
X1 <- cbind(r, X)
#initial design with added dimension:
scatterplot3d::scatterplot3d(X1, pch=16)

We can clearly see that the data points lie on two planes, thereby they should be distinct. Assume the centroids were at the center of the planes, notice how in some instances, the distance to a point directly below/above a centroid, might be smaller in comparison to the distance of a point to the centroid within the same plane. This will make points from different planes to be grouped together. To avoid this, we may increase the height by a factor equal to the maximum of the width. ie 4. This ensures that points on different planes will be far from each other compared to points on the same plane. We thus get the plot below:

X2 <- cbind(r, X[,1], 4*X[,2]) # [r, x1, 4*x2]
scatterplot3d::scatterplot3d(X2, pch=16, color=kmeans(X2, 2)$cluster)

Notice how on the transformed data, we were able to separate the data points into their respective clusters. We still used the native kmeans algorithm to cluster the points although we first had to transform the points. The notion of transformation enables us to make use of kernels, thereby being able to separate data that at first seems inseparable. Think of a kernel as a function that maps two two transformed inputs into one value. The inputs can be scalar or vector values. For instance, in our example above, the kernel can be thought of as being

\[ \begin{aligned} &\begin{array}{}K(x,y) &= [x_{1}, 4x_2]^\top[y_1,4y_2]\\ &=x_1y_1 + 4x_2y_2\\ &=x'A'Ay=x'By \end{array}\quad\quad\text{where}\quad A = \begin{pmatrix}1&0\\0&4\end{pmatrix}\text{ and } B=\begin{pmatrix}1&0\\0&16\end{pmatrix}\\ \\ \implies&K(X) = XBX^\top \end{aligned} \tag{8}\]

All kernels are inner products of the transformed values/vectors. In the above example, the transformation function considered was \(f(x) = Ax\). By using a function on \(X\), we can be able to separate the data: Let \(\phi(x,y)\) be the transformation function we could then do

\[ \begin{aligned} K(x_i,x_j) = &\left\langle \phi(x_i), \phi(x_j)\right\rangle\\ \text{Thus for the whole }& \text{dataset we have}\\ K =& \phi(X)\phi(X)^\top \end{aligned} \tag{9}\]

Since the kernel is an inner product, we do not need to know the explicit form of the transformation function. We just need to obtain the kernel results, and using the this, we can cluster the data.

Back to the example: Using everything as previously stated, kernel, we get the following:

Example 3 continuation

k <- 2
n <- nrow(X)
Jk <- rep(1,k)
Jn <- rep(1,n)
B <- matrix(c(1,0,0,16), 2)
K <- X %*% B%*% t(X)
grp <- sample(k, n, TRUE) #start with random groupings

repeat{
  Z <- diag(k)[grp,]
  M <- solve(crossprod(Z))
  A <- K%*%Z%*%M
  D <- diag(K)%*%t(Jk) - 2*A+ Jn%*%t(diag(t(Z)%*%A%*%M))
  new_grp <- max.col(-D)
  if(mean(new_grp == grp) == 1)break
  grp <- new_grp
}
ggplot(data.frame(X, col = factor(new_grp)))+
  geom_point(aes(X1,X2, col=col))

Note that kernels depict the relationship between one point to another. Thus are also referred to as correlation functions. Some common kernel] include:

Gaussian \(\exp\left(-(x_i-x_j)^2\right)\)
Mat ́ern,
power-exponential family \(\exp\left(-\big|x_i-x_j\big|^p\right)\) . If \(p=2\) we get the gaussian kernel

Mostly the gaussian is used. We could write a function that does kernelized k-means:

kernelized_kmeans <- function(K, k){
  n <- nrow(K)
  Jk <- rep(1, k)
  Jn <- rep(1, n)
  grp <- sample(k, n, TRUE) #start with random groupings
  
  repeat{
    Z <- diag(k)[grp,]
    M <- solve(crossprod(Z))
    A <- K%*%Z%*%M
    D <- diag(K)%*%t(Jk) - 2*A + Jn%*%t(diag(t(Z)%*%A%*%M))
    new_grp <- max.col(-D)
    if(mean(new_grp == grp) == 1)break
    grp <- new_grp
  }
  new_grp
}

All we have to do is pass in the kernel and the number of clusters.

Example 4.

Using gaussian kernel, cluster the data below:

library(ggplot2)
n <- 500
theta <- runif(n, 0,2*pi)
r2 <- runif(n, 3,4)
r1 <- runif(n, 0,1)
x1 <- r1*cos(theta)
y1 <- r1*sin(theta)
x2 <- r2*cos(theta)
y2 <- r2*sin(theta)
X <- cbind(x = c(x1,x2),y = c(y1,y2))
d <- data.frame(cbind(X, X), 
      col = c(factor(numeric(n*2)), factor(kmeans(X, 2)$cluster)),
      ty =c(' Unclustered', 'Clustered using Kmeans')[gl(2,1000)])

ggplot(d, aes(x, y, col=col))+
  geom_point() +coord_fixed()+
  scale_color_manual(values=c("grey40", scales::hue_pal()(2)))+
  facet_wrap(~ty)+
  theme(legend.position = "none")

Results using kernelized k-means

grp <- kernelized_kmeans(as.matrix(exp(-dist(X)^2)), 2)
ggplot(data.frame(X, cluster = factor(grp))) + 
  geom_point(aes(x, y, col = cluster)) +coord_fixed()

Summary:

To perform the native K-means, we use \(K = XX^\top\) as shown in example 2