Unsupervised Learning: Clustering

Anuradha Mohanty
4 min readMay 25, 2021

Clustering is an unsupervised learning technique where we try to find patterns based on similarities in the data.

There are two most commonly used types of clustering algorithms — K-Means Clustering and Hierarchical Clustering.

In unsupervised learning, we are not interested in prediction because we do not have a target or outcome variable. The objective is to discover interesting patterns in the data, e.g. are there any subgroups or ‘clusters’ among the retail customers.

Practical Applications of clustering can be seen in:

  • Retail: Cluster analysis can help the retail chain to get desired insights on customer demographics, purchase behavior, and demand patterns across locations.
  • Marketing: Cluster Analysis can help in market segmentation and positioning, and to identify test markets for new product development.
  • Social Media: Cluster Analysis is used to identify similar communities within larger groups.
  • Medical: Cluster Analysis has also been widely used in the field of biology and medical science like human genetic clustering, sequencing into gene families, building groups of genes, and clustering of organisms at species.

Customer segmentation for targeted marketing is one of the most vital applications of the clustering algorithm. Looking at patterns in customer data and then try and find segments. This is where clustering techniques can help you with segmenting the customers. Clustering techniques use raw data to form clusters based on common factors among various data points e.g in Retail, people or products will be grouped together on the basis of similarities and differences between them.

for successful segmentation, the segments formed must be stable i.e. the same person should not fall under different segments upon segmenting the data on the same criteria.

segments should have intra-segment homogeneity and inter-segment heterogeneity.

There are mainly 3 types of segmentation are used for customer segmentation:

  • Behavioral segmentation: Segmentation is based on the actual patterns displayed by the consumer
  • Attitudinal segmentation: Segmentation is based on the beliefs or the intents of people, which may not translate into similar action
  • Demographic segmentation: Segmentation is based on a person’s profile and uses information such as age, gender, residence locality, income, etc.

K-Means Clustering

The clustering algorithm needs to find data points whose values are similar to each other and therefore these points would then belong to the same cluster. The method in which any clustering algorithm goes about doing that is through the method of finding something called a “distance measure”. The distance measure that is used in K-means clustering is called the Euclidean Distance measure.

Euclidean Distance is length of the straight line joining the 2 points and is given by following formula:

Point X=(X1, X2), Point Y=(Y1,Y2)

the Euclidean Distance between the 2 points is measured as follows: If there are 2 points X and Y having n dimensions

The Euclidean Distance D is given by:

Essentially, the observations which are closer or more similar to each other would have a low Euclidean distance and the observations which are farther or less similar to each other would have a higher Euclidean distance.

A new concept crucial to clustering is Centroid. entroids are the centre points of the clusters that are being formed.

Data points clustered into 4 clusters without centroid

to compare two clusters we cannot say by how much units on average do the a Cluster differ from another just by taking a look at the above visualisation alone. This is where the concept of Centroids come in handy.

Sheet showimh Centroids

The centroid is calculated by computing the mean of each and every column/dimension that you have and then ordering them in the same way as above.

Therefore, Height-mean = ((175+165+183+172))/4 = 173.75
Weight-mean = ((83+74+98+80))/4 = 83.75
Age — mean = ((22+25+24+24))/4 =23.75

Thus the centroid of the above group of observations is (173.75, 83.75 and 23.75)

Reference: MSc Curriculum, LJMU

--

--