Clustering (요약)

 1. What is Clustering?
  • Unsupervised learning
  • Requires only data, no labels
  • Detect patterns
  • Useful when don't know what you're looking for
  • Group together "similar" instances
2. Partition algorithms
  • K-means
  • Mixture of Gaussian
  • Spectral clustering
3. Hierarchical algorithms
  • Bottom up : Agglomerative
  • Top-down : Divisive

4. K-means Clustering
  • Objective : $$ \min_{\mu} \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$
  • Initialize : Choose K points as cluster centers
  • Alternate : (EM algorithm)
    1. Assignment step (Expectation) : Assign data points to closest cluster center $$Fix \;\mu,\; optimize\;C \\ \min_C \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2 = \min_C \sum_{i=1}^n |x - \mu_{xi}|^2$$
    2. Refitting step (Maximization) : Change the cluster center to the average of its assigned points $$Fix \;C,\; optimize\;\mu \\ \min_{\mu} \sum_{i=1}^k \sum_{x\in C_i} |x - \mu_i|^2$$
  • Stop : when no point assignment change

5. K-means Clustering (Properties)
  • Guaranteed to converge in finite number of iteration (K-means takes an alternating optimization, each step is guaranteed to decrease the objective)
  • Running Time :
    1. Assign data points to closest cluster center : O(KN)
    2. Change the cluster center to average of groups : O(N)
  • K-means is Heuristic : Requires initial K and means 
  • Local minima
    • There is nothing to prevent K-means getting stuck at local minima
    • We could try many random starting points
    • We could try Non-local split-and-merge moves
      • Simultaneously merge two nearby clusters
      • Split a big cluster
  • Soft K-means
    • Instad of making hard assignments, make soft assignments
    • Allows a cluster to use more information about the data in the Maximization step
  • Generative view of clustering
    • Clustering이 잘 되었는지 판단하기 어려움
    • Clustering 결과를 Generative model로 만들어 결과를 판단할 수 있음


7. Mixture of Gaussian
  • Generative model
    • Data의 분포를 가장 잘 표현하는 가우시안 함수를 알아내고
    • 이를 통해 원본 Data와 유사한 새로운 데이터를 생성할 수 있다.
  • EM Algorithm
    • Expectation : 각 data의 proportion(fraction)을 계산
    • Maximization : 각 Gaussian의 mean, variance(standard deviation) 갱신

8. Spectral Clustering
  • Data를 graph로 간주하고 약한 간선을 끊어낸다
  • 일반적으로 Gaussian Kernel을 Similarity로 사용 $$W(i,j) = exp(\frac{-||x_i-x_j||^2}{\sigma^2})$$
  • Fully connected graph /or/ K-nearest neighbor graph
  • Procedure
    • Compute similarity (affinity) matrix : $W(i,j) = W_{i,j}$
    • Compute degree matrix : $D(i,i) = \sum_j W(i,j) $
    • Compute Laplacian matrix (graph Laplacian) : $L = D-W$
    • Compute eigenvector of L
    • Cluster eigenvectors of L that correspond to "small" eigenvalues (상관관계(eigenvalue)가 작은 연결관계)


9. Not able to cluster properly
Changing the feature (Distance function) can help



10. Hierarchical Clustering
  • Agglomerative (bottom up)
    • Start with the points as individual clusters
    • At each step, merge the closest pair of clusters until only one cluster (or K) left
  • Divisive (top down)
    • Start with one, all-inclusive cluster
    • At each step, split a cluster until each cluster contains a point (or there are K clusters)
  • Pros
    • Any desired number of clusters can be obtained (using Dendrogram) (vs. K-means)
    • May correspond meaningful taxonomies(분류학)
  • Complexity
    • At least quadratic of data points (make distance matrix)
    • Not usable for large database (Cons.)

11. Agglomerative Clustering
  • Merge similar(close) clusters
  • Incrementally build larger clusters out of smaller clusters
  • Dendrogram : family of clusters can be represented
  • Dendrogram을 사용해 가장 오래 cluster merge가 일어나지 않은 기간(duration)을 구하여, 최적의 K를 찾을 수 있다.
  • Define "close" cluster
    • Closest Pair (single-link clustering) $$D_{sl}(C_i,C_j) = \min_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Sensitive to noise and outliers
    • Farthest Pair (complete-link clustering) $$D_{cl}(C_i,C_j) = \max_{x,y}\{d(x,y)|x\in C_i, y\in C_j \} $$ - Tends to break large clusters (+모든 cluster가 동일한 크기를 가지게 하려는 경향이 있다)
    • Average of all pairs $$D_{avg}(C_i,C_j) = \frac{1}{|C_i|\times |C_j|} \sum_{x\in C_i, y\in C_j} d(x,y) $$ - 구모양의 cluster를 만드는 경향이 있다

12. Divisive hierarchical clustering
  • Monothetic divisive : split clusters using one variable/dimension at a time
  • Polythetic divisive : splits basis of all variables together
  • Computationally intensive, less widely used than agglomerative methods
  • Check the sum of squared errors of each cluster and choose the one the largest value













댓글

이 블로그의 인기 게시물

One-Class SVM & SVDD

Support Vector Regression (SVR)

Self-Training & Co-Training