Let's review Unsupervised Learning !
1. Types of Clustering
1.1 Exclusive vs. overlapping clustering
• Can an item be in more than one cluster?
• Deterministic vs. probabilistic clustering (Hard vs. soft clustering)
• Can an item be partially or weakly in a cluster?
probabilistic clustering: got a few of these instances that lie on the boundary and they're not necessarily belonging to one cluster or another
1.2 Hierarchical vs. partitioning clustering
• Do the clusters have subset relationships between them? e.g. nested in a tree?
Hierarchical:
- useful when you have got taxonomy or semantics in your data;
- Do not know the number of clusters in advance;
- You have got compact information about objects that are very similar to each other;
Partitioning: know the number of clusters in advance
1.3 Partial vs. complete
• In some cases, we only want to cluster some of the data
Partial: useful when you got noise in your data
1.4 Heterogenous vs. homogenous
• Clusters of widely different sizes, shapes, and densities
• Incremental vs. batch clustering
• Is the whole set of items clustered in one go?
1.5 Batch vs. Incremental
Batch: entitle dataset
Incremental: 1. Dynamic dataset: time series and data stream
2.You would have a portion of the data that you have observed so far. And then new instances arrive, and you have to incrementally update your centrals or your clustering algorithm to contain those new instances in a cluster or create a new cluster for them.
2. Desiderata
• Scalability; high dimensionality
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Able to deal with noise and outliers
• Insensitive to order of input records
3. Types of Evaluation
3.1 Unsupervised Evaluation
Def: Measures the goodness of a clustering structure without respect to external information. Includes measures of cluster cohesion (compactness, tightness), and measures of cluster separation (isolation, distinctiveness).
3.1.1 Good cluster?
3.1.2 SSE
3.2 Supervised
Def: Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.
For instance, entropy can measure how well cluster labels match externally supplied class labels.
A good cluster should have high purity and low entropy. Because both measures reflect the consistency of labels within the cluster.
4. Methods
• A key component of any clustering algorithm is a measurement of the distance between any points. (Similarity / Proximity / Closeness)
4.1 how to measuring Similarity / Proximity / Closeness
4.1.1 Data points in Euclidean space
• Euclidean distance
• Manhattan (L1) distance
4.1.2 Discrete values
• Hamming distance (discrepancy between the bit strings) For two bit strings, the number of positions at which the corresponding symbols are different
4.1.3 Documents
• Cosine similarity • Jaccard measure
4.1.4 Other measures
• Correlation
• Graph-based measures
4.2 k-means Clustering
4.2.1 steps
4.2.2 Details
• Initial centroids: chosen randomly.
• The centroid: the mean of the points in the cluster.
• ‘Nearest’ is based on proximity/similarity/etc. metric.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition: ‘Until relatively few points change clusters’ (this way the stopping criterion will not depend on the type of similarity or dimensionality)
4.2.3 pros and cons
Strengths: • easy to implement
• relatively efficient
• O(ndki), where n is no. instances, d is no. attributes, k is no. clusters, and i is no. iterations; normally k, i << n
• Unfortunately, we cannot a priori know the value of i!
Weaknesses: • tends to converge to local minimum; sensitive to seed instances (try multiple iterations with different seeds?)
• need to specify k in advance • not able to handle non-convex clusters, or clusters of differing densities or sizes
• “mean” ill-defined for nominal or categorical attributes • may not work well when the data contains outliers
4.2.4
5. Hierarchical Clustering
In contrast to k -means clustering, hierarchical clustering only requires a measure of similarity between groups of data points (no seeds, no k value).
5.1 Bottom-up (= agglomerative) clustering
• Start with single-instance clusters
• At each step, join the two closest clusters (in terms of margin between clusters, distance between mean, ...)
• until Only one cluster remains
5.2 Top-down (= divisive) clustering
• Start with one universal cluster
• Find two partitioning clusters
• Proceed recursively on each subset
• Can be very fast
5.3 Graph-based measure of Proximity
Updating the proximity matrix: • Single Link: Minimum distance between any two points in the two clusters. (most similar members)
• Complete Link: Maximum distance between any two points in the two clusters. (most dissimilar members)
• Group Average: Average distance between all points (pairwise).
Summary
• What basic contrasts are there in different clustering methods? • How does k -means operate, and what are its strengths and weaknesses?
• What is hierarchical clustering, and how does it differ from partitioning clustering?
• What are some challenges we face when clustering data?
References
COMP90049 Introduction to Machine Learning. The University of Melbourne. Semester 2, 2020.