Clustering (요약)

Clustering (요약)

- 6월 19, 2021

1. What is Clustering?

Unsupervised learning
Requires only data, no labels
Detect patterns
Useful when don't know what you're looking for
Group together "similar" instances

2. Partition algorithms

K-means
Mixture of Gaussian
Spectral clustering

3. Hierarchical algorithms

Bottom up : Agglomerative
Top-down : Divisive

4. K-means Clustering

Objective : $$ \min_{\mu} \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$
Initialize : Choose K points as cluster centers
Alternate : (EM algorithm)

Assignment step (Expectation) : Assign data points to closest cluster center $$Fix \;\mu,\; optimize\;C \\ \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2 = \min_C \sum_{i=1}^n |x - \mu_{xi}|^2$$
Refitting step (Maximization) : Change the cluster center to the average of its assigned points $$Fix \;C,\; optimize\;\mu \\ \min_{\mu} \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$

Stop : when no point assignment change

5. K-means Clustering (Properties)

Guaranteed to converge in finite number of iteration (K-means takes an alternating optimization, each step is guaranteed to decrease the objective)
Running Time :

Assign data points to closest cluster center : O(KN)
Change the cluster center to average of groups : O(N)

K-means is Heuristic : Requires initial K and means
Local minima

There is nothing to prevent K-means getting stuck at local minima
We could try many random starting points
We could try Non-local split-and-merge moves

Simultaneously merge two nearby clusters
Split a big cluster

Soft K-means

Instad of making hard assignments, make soft assignments
Allows a cluster to use more information about the data in the Maximization step

Generative view of clustering

Clustering이 잘 되었는지 판단하기 어려움
Clustering 결과를 Generative model로 만들어 결과를 판단할 수 있음

7. Mixture of Gaussian

Generative model

Data의 분포를 가장 잘 표현하는 가우시안 함수를 알아내고
이를 통해 원본 Data와 유사한 새로운 데이터를 생성할 수 있다.

EM Algorithm

Expectation : 각 data의 proportion(fraction)을 계산
Maximization : 각 Gaussian의 mean, variance(standard deviation) 갱신

8. Spectral Clustering

Data를 graph로 간주하고 약한 간선을 끊어낸다
일반적으로 Gaussian Kernel을 Similarity로 사용 $$W(i,j) = exp(\frac{-||x_i-x_j||^2}{\sigma^2})$$
Fully connected graph /or/ K-nearest neighbor graph
Procedure

Compute similarity (affinity) matrix : $W(i,j) = W_{i,j}$
Compute degree matrix : $D(i,i) = \sum_j W(i,j) $
Compute Laplacian matrix (graph Laplacian) : $L = D-W$
Compute eigenvector of L
Cluster eigenvectors of L that correspond to "small" eigenvalues (상관관계(eigenvalue)가 작은 연결관계)

9. Not able to cluster properly

Changing the feature (Distance function) can help

10. Hierarchical Clustering

Agglomerative (bottom up)

Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster (or K) left

Divisive (top down)

Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or there are K clusters)

Pros

Any desired number of clusters can be obtained (using Dendrogram) (vs. K-means)
May correspond meaningful taxonomies(분류학)

Complexity

At least quadratic of data points (make distance matrix)
Not usable for large database (Cons.)

11. Agglomerative Clustering

Merge similar(close) clusters
Incrementally build larger clusters out of smaller clusters
Dendrogram : family of clusters can be represented
Dendrogram을 사용해 가장 오래 cluster merge가 일어나지 않은 기간(duration)을 구하여, 최적의 K를 찾을 수 있다.
Define "close" cluster

Closest Pair (single-link clustering) $$D_{sl}(C_i,C_j) = \min_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Sensitive to noise and outliers
Farthest Pair (complete-link clustering) $$D_{cl}(C_i,C_j) = \max_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Tends to break large clusters (+모든 cluster가 동일한 크기를 가지게 하려는 경향이 있다)
Average of all pairs $$D_{avg}(C_i,C_j) = \frac{1}{|C_i|\times |C_j|} \sum_{x\in C_i, y\in C_j} d(x,y) $$ - 구모양의 cluster를 만드는 경향이 있다

12. Divisive hierarchical clustering

Monothetic divisive : split clusters using one variable/dimension at a time
Polythetic divisive : splits basis of all variables together
Computationally intensive, less widely used than agglomerative methods
Check the sum of squared errors of each cluster and choose the one the largest value

댓글