isme_nanobi_internship

OPTIC clustering

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm, similar to DBSCAN (Density-Based Spatial Clustering of Applications with Noise), but it can extract clusters of varying densities and shapes. It is useful for identifying clusters of different densities in large, high-dimensional datasets.

The main idea behind OPTICS is to extract the clustering structure of a dataset by identifying the density-connected points. The algorithm builds a density-based representation of the data by creating an ordered list of points called the reachability plot. Each point in the list is associated with a reachability distance, which is a measure of how easy it is to reach that point from other points in the dataset. Points with similar reachability distances are likely to be in the same cluster.

The OPTICS algorithm follows these main steps:

Define a density threshold parameter, Eps, which controls the minimum density of clusters. For each point in the dataset, calculate the distance to its k-nearest neighbours. Starting with an arbitrary point, calculate the reachability distance of each point in the dataset, based on the density of its neighbours. Order the points based on their reachability distance and create the reachability plot. Extract clusters from the reachability plot by grouping points that are close to each other and have similar reachability distances. One of the main advantage of OPTICS over DBSCAN, is that it does not require to set the number of clusters in advance, instead, it extracts the clustering structure of the data and produces the reachability plot. This allows the user to have more flexibility in selecting the number of clusters, by cutting the reachability plot at a certain point. Also, unlike other density-based clustering algorithms like DBSCAN, It can handle clusters of different densities and shapes and can identify hierarchical structure.

OPTICS is implemented in Python using the sklearn.cluster. OPTICS class in the scikit-learn library. It takes several parameters including the minimum density threshold (Eps), the number of nearest neighbours to consider (min_samples), and a reachability distance cut off (xi).

They are: -

· Core Distance: It is the minimum value of radius required to classify a given point as a core point. If the given point is not a Core point, then it’s Core Distance is undefined.

· Reachability Distance: It is defined with respect to another data point q(Let). The Reachability distance between a point p and q is the maximum of the Core Distance of p and the Euclidean Distance (or some other distance metric) between p and q. Note that The Reachability Distance is not defined if q is not a Core point.

This clustering technique is different from other clustering techniques in the sense that this technique does not explicitly segment the data into clusters. Instead, it produces a visualization of Reachability distances and uses this visualization to cluster the data.

Sample optic clustering:

Output:

How OPTICS clustering is different from other clustering algorithms:

OPTICS (Ordering Points to Identify the Clustering Structure) is a density-based clustering algorithm that differs from other clustering algorithms, such as k-means or hierarchical clustering, in several ways:

· Density-based: OPTICS is a density-based clustering algorithm, which means it identifies clusters based on the density of data points in the feature space. It does not assume that clusters have a specific shape, size, or number of points.

· Reachability Plot: OPTICS introduces the concept of a reachability plot, which provides a visualization of the clustering structure. The reachability plot represents the distance at which a data point is reachable from its nearest neighbours. It helps in identifying clusters of varying densities and determining an appropriate clustering threshold.

· Variable Cluster Size: OPTICS can identify clusters of varying sizes and densities, including clusters with irregular shapes. It does not assume that clusters have a uniform size or contain a specific number of points.

· No Need for Predefined Number of Clusters: Unlike k-means or hierarchical clustering, OPTICS does not require specifying the number of clusters beforehand. It can discover the optimal number of clusters based on the density and connectivity of the data points.

· Robust to Noise and Outliers: OPTICS is robust to noise and outliers because it considers the density of data points. Outliers have lower density and are not assigned to any cluster, while noise points are treated as core samples of their own clusters.

· Flexibility in Clustering Parameters: OPTICS provides flexibility in setting clustering parameters such as the minimum number of points required to form a cluster (min_samples) and the maximum distance between points to be considered in the same neighborhood (eps). These parameters can be adjusted to control the granularity and sensitivity of the clustering.

Summary:

OPTICS (Ordering Points to Identify the Clustering Structure) is a powerful density-based clustering algorithm that offers several advantages over traditional clustering methods like DBSCAN. OPTICS can identify clusters of varying densities and shapes in large, high-dimensional datasets, making it particularly useful in analysing complex data. Compared to other clustering algorithms, OPTICS can handle datasets with clusters of different densities and shapes, providing a more comprehensive understanding of the underlying data structure. It also has the ability to reveal hierarchical relationships within the data, further enhancing the insights gained from clustering analysis. With its implementation in Python through libraries like scikit-learn, OPTICS becomes easily accessible to data scientists and researchers. By leveraging the OPTICS algorithm, practitioners can efficiently explore and extract valuable information from their datasets, enabling better decision-making, pattern recognition, and clustering analysis in various fields. OPTICS serves as a valuable tool for uncovering hidden structures and patterns in data, making it an important technique in exploratory data analysis, machine learning, and data mining applications.

Reference: https://www.geeksforgeeks.org/ml-optics-clustering-explanation/

ISME Student Doing internship with Hunnarvi Technologies Pvt Ltd under guidance of Nanobi data and analytics. Views are personal.

Search This Blog

isme_nanobi_internship_2023

Comments

Post a Comment

Popular posts from this blog

Koala: A Dialogue Model for Academic Research