Welcome to the Clustering section! This folder provides an introduction to clustering techniques, which are essential for unsupervised learning tasks. Clustering helps you group similar data points together, making it easier to identify patterns or categories within your data.
Note: The notebooks here are designed for beginners. They introduce foundational concepts but do not cover all available clustering methods or advanced techniques. For a more comprehensive understanding, please refer to the recommended resources provided below.
This folder currently includes:
- Agglomerative Clustering: A hierarchical clustering technique based on merging clusters.
- DBSCAN: A density-based clustering method, useful for identifying clusters of arbitrary shape.
- K-Means: A popular clustering method that partitions data into a specified number of clusters.
- K-Medoids: A robust clustering algorithm that partitions data into clusters by selecting actual data points (medoids) as cluster centers. It minimizes the sum of dissimilarities between points and their assigned medoid, making it more resilient to outliers than K-Means.
- Spectral Clustering: A technique for clustering data that is not linearly separable, using eigenvalues of a similarity matrix.
- Clustering Quality Evaluation: Evaluate the performance of various clustering algorithms
- Comparison of Various Clustering Algorithms: Compare various clustering algorithms to find which one suits your data the best.
Each section includes assignments to help reinforce your understanding, along with solutions for self-assessment.
Follow these steps to build a strong foundation in clustering techniques:
- Purpose: This hierarchical clustering method begins by treating each data point as an individual cluster, then successively merges the closest clusters.
- Topics to Cover:
- Basics of hierarchical clustering
- Linkage criteria (single, complete, average)
- Dendrograms for visualizing hierarchical clusters
- Resources:
- Purpose: DBSCAN identifies clusters based on data density, making it useful for detecting clusters with arbitrary shapes.
- Topics to Cover:
- Core points, border points, and noise points
- Selecting parameters like epsilon and minimum samples
- DBSCAN’s advantages for handling noise and non-linear shapes
- Resources:
- DBSCAN (Sklearn Documentation)
- DBSCAN Tutorial (Towards Data Science)
- StatQuest's DBSCAN (video)
- Purpose: K-Means is a partitioning clustering algorithm that aims to split data into a predefined number of clusters.
- Topics to Cover:
- Centroid calculation and cluster assignment
- Elbow method for choosing the optimal number of clusters
- Limitations of K-Means (e.g., sensitivity to initial centroids)
- Resources:
- K-Means (Sklearn Documentation)
- K-Means Clustering Guide (Kaggle tutorial)
- K means Clustering Algorithm (video)
- Purpose: K-Medoids is a partitioning clustering algorithm that selects actual data points (medoids) as cluster centers, making it more robust to outliers than K-Means.
- Topics to Cover:
- Medoid selection and cluster assignment
- Differences between K-Medoids and K-Means (e.g., robustness to noise and outliers)
- Resources:
- Purpose: Spectral Clustering is a technique for clustering data that is not linearly separable. It uses the eigenvalues of a similarity matrix to perform dimensionality reduction before clustering in the lower-dimensional space.
- Topics to Cover:
- Affinity matrix and graph Laplacian
- Eigenvalue decomposition and its role in clustering
- Parameter selection (e.g.,
n_clusters,affinity,gamma)
- Resources:
- Purpose: Clustering evaluation metrics help assess the performance of the Clustering Algorithms.
- Topics to Cover:
- Silhouette Score
- Davies Bouldin Index
- Adjusted Rand Score (ARI)
- Resources:
- Purpose: Comparing different clustering algorithms.
- Topics to Cover:
- Compare K - Means, Hierarchical and DBSCAN (Density Based Spatial Clustering of Applications with Noise).
- Compare strengths and weaknesses of each.
- Compare the runtime complexity of each.
- Resources:
Each clustering method comes with assignments designed to help you apply the concepts you've learned. Solutions are provided for self-evaluation. Try to complete the assignments independently before checking the solutions for the best learning experience.
- Begin with Agglomerative Clustering: Start by understanding how hierarchical clustering builds clusters step-by-step.
- Explore DBSCAN: Learn how DBSCAN groups data based on density, making it robust for non-linear data.
- Try K-Means: Experiment with partitioning data into clusters, focusing on selecting the optimal number of clusters.
- Try K-Medoids: Experiment with partitioning data into clusters while minimizing dissimilarity within each cluster. Unlike K-Means, K-Medoids selects actual data points as cluster centers, making it more robust to outliers
- Dive into Spectral Clustering: Understand how Spectral Clustering handles non-linearly separable data using eigenvalues and similarity matrices.
- Evaluate Performance: Assess the performance of the above-mentioned algorithms and find the most suitable one for your data.
- Explore various clustering algorithms: Experiment with various other clustering algorithms to compare and see which one is the best for your data.
Happy clustering! Developing these skills will enable you to analyze data and identify patterns effectively. For further learning, refer to the documentation and tutorials linked above.