Clustering is used to make sense of large volumes of data, structured or unstructured, by dividing the data into groups. The members of a group are ‘’similar’’ between them and ‘’dissimilar’’ to objects in other groups. The similarity is based on characteristics such as equal distances from a point or people who read the same genre of book. These groups with similar members are called clusters. The various methods of clustering, which we shall be discussing subsequently, help break up data into logical groupings before analyzing the data more deeply.
If a CEO of a company presents a broad question like- ‘’ Help me understand our customers better so that we can improve marketing strategies’’, then the first thing analysts need to do is use clustering methods to the classify customers. Clustering has plenty of application in our daily lives. Some of the domains where clustering is used are:
Clustering can be performed by various methods, as shown in the diagram below:
The two major techniques used to perform clustering are:
Fig 2: Agglomerative clustering follows a bottom-up approach while divisive clustering follows a top-down approach.
The most popular metric used for forming clusters or deciding the closeness of clusters is distance. There are various distance measures. All observations are measured using one particular distance measure and the observation having the minimum distance from a cluster is assigned to it. The different distance measures are:
Distance((x, y), (a, b)) = √(x – a)² + (y – b)²
For example, the Euclidean distance between points (2, -1) and (-2, 2) is found to be
Distance((2, -1), (-2, 2))
This gives the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), Manhattan distance is |x1 – x2| + |y1 – y2|.
Hamming distance between two vectors is the number of bits we must change to convert one into the other. For example, to find the distance between vectors 01101010 and 11011011, we observe that they differ in 4 places. So, the Hamming distance d(01101010, 11011011) = 4
The Minkowski distance between two variables X and Y is defined as
The case where p = 1 is equivalent to the Manhattan distance and the case where p = 2 is equivalent to the Euclidean distance.
These distance measures are used to measure the closeness of clusters in hierarchical clustering.
In the next blogs, we will discuss the different methods of clustering in more details, so make sure you follow DexLab Analytics– we provide the best big data Hadoop certification in Gurgaon. Do check our data analyst courses in Gurgaon.
Interested in a career in Data Analyst?
To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.