OPTIC clustering
OPTICS (Ordering Points To Identify the
Clustering Structure) is a density-based clustering algorithm, similar to
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), but it
can extract clusters of varying densities and shapes. It is useful for
identifying clusters of different densities in large, high-dimensional
datasets.
The main idea behind OPTICS is to extract
the clustering structure of a dataset by identifying the density-connected
points. The algorithm builds a density-based representation of the data by creating
an ordered list of points called the reachability plot. Each point in the list
is associated with a reachability distance, which is a measure of how easy it
is to reach that point from other points in the dataset. Points with similar
reachability distances are likely to be in the same cluster.
The
OPTICS algorithm follows these main steps:
Define a density threshold parameter, Eps,
which controls the minimum density of clusters. For each point in the dataset,
calculate the distance to its k-nearest neighbours. Starting with an arbitrary
point, calculate the reachability distance of each point in the dataset, based
on the density of its neighbours. Order the points based on their reachability
distance and create the reachability plot. Extract clusters from the
reachability plot by grouping points that are close to each other and have
similar reachability distances. One of the main advantage of OPTICS over
DBSCAN, is that it does not require to set the number of clusters in advance,
instead, it extracts the clustering structure of the data and produces the
reachability plot. This allows the user to have more flexibility in selecting
the number of clusters, by cutting the reachability plot at a certain point. Also,
unlike other density-based clustering algorithms like DBSCAN, It can handle
clusters of different densities and shapes and can identify hierarchical
structure.
OPTICS is implemented in Python using the
sklearn.cluster. OPTICS class in the scikit-learn library. It takes several
parameters including the minimum density threshold (Eps), the number of nearest
neighbours to consider (min_samples), and a reachability distance cut off (xi).
They
are: -
·
Core Distance: It is the minimum value of
radius required to classify a given point as a core point. If the given point
is not a Core point, then it’s Core Distance is undefined.
·
Reachability Distance: It is defined with
respect to another data point q(Let). The Reachability distance between a point
p and q is the maximum of the Core Distance of p and the Euclidean Distance (or
some other distance metric) between p and q. Note that The Reachability
Distance is not defined if q is not a Core point.
This clustering technique is different
from other clustering techniques in the sense that this technique does not
explicitly segment the data into clusters. Instead, it produces a visualization
of Reachability distances and uses this visualization to cluster the data.
Sample
optic clustering:
Output:
How
OPTICS clustering is different from other clustering algorithms:
OPTICS (Ordering Points to Identify the
Clustering Structure) is a density-based clustering algorithm that differs from
other clustering algorithms, such as k-means or hierarchical clustering, in
several ways:
·
Density-based: OPTICS is a density-based
clustering algorithm, which means it identifies clusters based on the density
of data points in the feature space. It does not assume that clusters have a
specific shape, size, or number of points.
·
Reachability Plot: OPTICS introduces the
concept of a reachability plot, which provides a visualization of the
clustering structure. The reachability plot represents the distance at which a
data point is reachable from its nearest neighbours. It helps in identifying
clusters of varying densities and determining an appropriate clustering
threshold.
·
Variable Cluster Size: OPTICS can identify
clusters of varying sizes and densities, including clusters with irregular
shapes. It does not assume that clusters have a uniform size or contain a
specific number of points.
·
No Need for Predefined Number of Clusters:
Unlike k-means or hierarchical clustering, OPTICS does not require specifying
the number of clusters beforehand. It can discover the optimal number of
clusters based on the density and connectivity of the data points.
·
Robust to Noise and Outliers: OPTICS is
robust to noise and outliers because it considers the density of data points.
Outliers have lower density and are not assigned to any cluster, while noise
points are treated as core samples of their own clusters.
· Flexibility
in Clustering Parameters: OPTICS provides flexibility in setting clustering
parameters such as the minimum number of points required to form a cluster
(min_samples) and the maximum distance between points to be considered in the
same neighborhood (eps). These parameters can be adjusted to control the
granularity and sensitivity of the clustering.
Summary:
OPTICS (Ordering Points to Identify the
Clustering Structure) is a powerful density-based clustering algorithm that
offers several advantages over traditional clustering methods like DBSCAN.
OPTICS can identify clusters of varying densities and shapes in large,
high-dimensional datasets, making it particularly useful in analysing complex
data. Compared to other clustering algorithms, OPTICS can handle datasets with
clusters of different densities and shapes, providing a more comprehensive
understanding of the underlying data structure. It also has the ability to
reveal hierarchical relationships within the data, further enhancing the
insights gained from clustering analysis. With its implementation in Python
through libraries like scikit-learn, OPTICS becomes easily accessible to data
scientists and researchers. By leveraging the OPTICS algorithm, practitioners
can efficiently explore and extract valuable information from their datasets,
enabling better decision-making, pattern recognition, and clustering analysis
in various fields. OPTICS serves as a valuable tool for uncovering hidden
structures and patterns in data, making it an important technique in
exploratory data analysis, machine learning, and data mining applications.
Reference: https://www.geeksforgeeks.org/ml-optics-clustering-explanation/
ISME Student Doing internship with
Hunnarvi Technologies Pvt Ltd under guidance of Nanobi data and analytics.
Views are personal.
Comments
Post a Comment