Clustering (요약)
1. What is Clustering?
- Unsupervised learning
- Requires only data, no labels
- Detect patterns
- Useful when don't know what you're looking for
- Group together "similar" instances
2. Partition algorithms
- K-means
- Mixture of Gaussian
- Spectral clustering
3. Hierarchical algorithms
- Bottom up : Agglomerative
- Top-down : Divisive
4. K-means Clustering
- Objective : $$ \min_{\mu} \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$
- Initialize : Choose K points as cluster centers
- Alternate : (EM algorithm)
- Assignment step (Expectation) : Assign data points to closest cluster center $$Fix \;\mu,\; optimize\;C \\ \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2 = \min_C \sum_{i=1}^n |x - \mu_{xi}|^2$$
- Refitting step (Maximization) : Change the cluster center to the average of its assigned points $$Fix \;C,\; optimize\;\mu \\ \min_{\mu} \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$
- Stop : when no point assignment change
5. K-means Clustering (Properties)
- Guaranteed to converge in finite number of iteration (K-means takes an alternating optimization, each step is guaranteed to decrease the objective)
- Running Time :
- Assign data points to closest cluster center : O(KN)
- Change the cluster center to average of groups : O(N)
- K-means is Heuristic : Requires initial K and means
- Local minima
- There is nothing to prevent K-means getting stuck at local minima
- We could try many random starting points
- We could try Non-local split-and-merge moves
- Simultaneously merge two nearby clusters
- Split a big cluster
- Soft K-means
- Instad of making hard assignments, make soft assignments
- Allows a cluster to use more information about the data in the Maximization step
- Generative view of clustering
- Clustering이 잘 되었는지 판단하기 어려움
- Clustering 결과를 Generative model로 만들어 결과를 판단할 수 있음
7. Mixture of Gaussian
- Generative model
- Data의 분포를 가장 잘 표현하는 가우시안 함수를 알아내고
- 이를 통해 원본 Data와 유사한 새로운 데이터를 생성할 수 있다.
- EM Algorithm
- Expectation : 각 data의 proportion(fraction)을 계산
- Maximization : 각 Gaussian의 mean, variance(standard deviation) 갱신
8. Spectral Clustering
- Data를 graph로 간주하고 약한 간선을 끊어낸다
- 일반적으로 Gaussian Kernel을 Similarity로 사용 $$W(i,j) = exp(\frac{-||x_i-x_j||^2}{\sigma^2})$$
- Fully connected graph /or/ K-nearest neighbor graph
- Procedure
- Compute similarity (affinity) matrix : $W(i,j) = W_{i,j}$
- Compute degree matrix : $D(i,i) = \sum_j W(i,j) $
- Compute Laplacian matrix (graph Laplacian) : $L = D-W$
- Compute eigenvector of L
- Cluster eigenvectors of L that correspond to "small" eigenvalues (상관관계(eigenvalue)가 작은 연결관계)
9. Not able to cluster properly
10. Hierarchical Clustering
- Agglomerative (bottom up)
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters until only one cluster (or K) left
- Divisive (top down)
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster contains a point (or there are K clusters)
- Pros
- Any desired number of clusters can be obtained (using Dendrogram) (vs. K-means)
- May correspond meaningful taxonomies(분류학)
- Complexity
- At least quadratic of data points (make distance matrix)
- Not usable for large database (Cons.)
11. Agglomerative Clustering
- Merge similar(close) clusters
- Incrementally build larger clusters out of smaller clusters
- Dendrogram : family of clusters can be represented
- Dendrogram을 사용해 가장 오래 cluster merge가 일어나지 않은 기간(duration)을 구하여, 최적의 K를 찾을 수 있다.
- Define "close" cluster
- Closest Pair (single-link clustering) $$D_{sl}(C_i,C_j) = \min_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Sensitive to noise and outliers
- Farthest Pair (complete-link clustering) $$D_{cl}(C_i,C_j) = \max_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Tends to break large clusters (+모든 cluster가 동일한 크기를 가지게 하려는 경향이 있다)
- Average of all pairs $$D_{avg}(C_i,C_j) = \frac{1}{|C_i|\times |C_j|} \sum_{x\in C_i, y\in C_j} d(x,y) $$ - 구모양의 cluster를 만드는 경향이 있다
12. Divisive hierarchical clustering
- Monothetic divisive : split clusters using one variable/dimension at a time
- Polythetic divisive : splits basis of all variables together
- Computationally intensive, less widely used than agglomerative methods
- Check the sum of squared errors of each cluster and choose the one the largest value
댓글
댓글 쓰기