Hierarchical Clustering for Location based Strategy using R for E-Commerce

[This article was first published on R – R vs Shubham, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hi Folks! This is my first blog and I am super excited to share with you how I used R Programming to work upon a location based strategy in my E commerce organization.

Please check out r-bloggers.com for more exciting stuff on R

Just a little brief about the problem statement

I work for an e-commerce organization (an online travel platform) for booking hotels and flights based out of India. This problem is concerned with the Hotel department.

Each locality in a city behaves differently based on certain features of the locality e.g. Airport Zone of a city would behave differently from a Central Zone in the vicinity of a famous Historical Site. Therefore separate strategies are required for different areas for monitoring and controlling parameters such as Inventory, Production and Demand.

In the data I had latitude and longitude for each hotel and the task was to identify clusters of these hotels or what we call a hyperlocation.

Let’s Get Started

This is how the data looks

Let’s look at it Visually (I am using Power BI here)

Set of Hotels from Delhi

Outlier Removal In the image above we can see there are certain hotels outside the city that can create problems while forming clusters, let’s remove these outliers statistically.

library(geosphere)

#Mean of Lat Lon
MeanLat<- mean(HotelsCity$latitude, na.rm = TRUE)
MeanLon<- mean(HotelsCity$longitude, na.rm = TRUE)


#Distance of all hotels from mean lat lon
HotelsLatLon<- HotelsCity[,c(4,5)]
MeanLatLon<- data.frame(MeanLat,MeanLon)
Distance_Mat<- distm(HotelsLatLon[2:1],MeanLatLon[2:1],fun = distHaversine)
Distance_Mat<- as.data.frame(Distance_Mat)


#Calculating Cutoff Distance for Outlier Removal 

IQR<- IQR(as.numeric(Distance_Mat[,1]),na.rm = TRUE)

Cutoff<-  as.numeric(quantile(Distance_Mat$V1,0.75,na.rm = TRUE)+IQR*1.5)

HotelDetail$Flag<- ifelse(HotelDetail$V1>Cutoff,"Incorrect","Correct")
Outliers_Final<- filter(HotelDetail,Flag=="Incorrect")
This is how the outliers look when plotted

Clustering

After cleaning the data (outlier removal) now let’s create a distance matrix i.e. distance of each hotel from every other hotel, I am doing this using the geosphere library in R.

#Distance Matrix for city  
Distance_Mat<- distm(HotelsLatLon[2:1],HotelsLatLon[2:1],fun = distHaversine)
Distance_Mat<- as.data.frame(Distance_Mat)
Distance_Mat[is.na(Distance_Mat)]<-0
DMat<- as.dist(Distance_Mat)
This is how the distance matrix looks like.

Let’s Create Clusters now.

#Hierarchical Clustering
hc <- hclust(DMat, method="complete")
HotelCity_Valid$Clusters<- cutree(hc, h=AvgDist2)

In the above code snippet in the cutree function I have used a different cutoff distance for different cities. How I arrived at that distance is a different science altogether, in this case the cutoff distance is around 2 KMs which means that each cluster would be roughly of a diameter of 2 KMs.

This is how these different clusters look like when plotted

How I named these localities? There was a system name tagged to each hotel’s locality, I used the most frequent name in that cluster as the Cluster Name

Please reach out to me at [email protected] for any kind of queries regarding this.

To leave a comment for the author, please follow the link and comment on their blog: R – R vs Shubham.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)