# Hierarchical Clustering for Location based Strategy using R for E-Commerce

**R – R vs Shubham**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hi Folks! This is my first blog and I am super excited to share with you how I used R Programming to work upon a location based strategy in my E commerce organization.

Please check out r-bloggers.com for more exciting stuff on R

## **Ju****st a litt****le brief about the problem statement**

**Ju**

**st a litt**

**le brief about the problem statement**

I work for an e-commerce organization (an online travel platform) for booking hotels and flights based out of India. This problem is concerned with the Hotel department.

Each locality in a city behaves differently based on certain features of the locality e.g. Airport Zone of a city would behave differently from a Central Zone in the vicinity of a famous Historical Site. Therefore separate strategies are required for different areas for monitoring and controlling parameters such as Inventory, Production and Demand.

In the data I had latitude and longitude for each hotel and the task was to identify clusters of these hotels or what we call a hyperlocation.

**Let’s Get Started**

This is how the data looks

Let’s look at it Visually (I am using Power BI here)

**Outlier Removal** In the image above we can see there are certain hotels outside the city that can create problems while forming clusters, let’s remove these outliers statistically.

library(geosphere) #Mean of Lat Lon MeanLat<- mean(HotelsCity$latitude, na.rm = TRUE) MeanLon<- mean(HotelsCity$longitude, na.rm = TRUE) #Distance of all hotels from mean lat lon HotelsLatLon<- HotelsCity[,c(4,5)] MeanLatLon<- data.frame(MeanLat,MeanLon) Distance_Mat<- distm(HotelsLatLon[2:1],MeanLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) #Calculating Cutoff Distance for Outlier Removal IQR<- IQR(as.numeric(Distance_Mat[,1]),na.rm = TRUE) Cutoff<- as.numeric(quantile(Distance_Mat$V1,0.75,na.rm = TRUE)+IQR*1.5) HotelDetail$Flag<- ifelse(HotelDetail$V1>Cutoff,"Incorrect","Correct") Outliers_Final<- filter(HotelDetail,Flag=="Incorrect")

## Clustering

After cleaning the data (outlier removal) now let’s create a distance matrix i.e. distance of each hotel from every other hotel, I am doing this using the geosphere library in R.

#Distance Matrix for city Distance_Mat<- distm(HotelsLatLon[2:1],HotelsLatLon[2:1],fun = distHaversine) Distance_Mat<- as.data.frame(Distance_Mat) Distance_Mat[is.na(Distance_Mat)]<-0 DMat<- as.dist(Distance_Mat)

**Let’s Create Clusters now.**

#Hierarchical Clustering hc <- hclust(DMat, method="complete") HotelCity_Valid$Clusters<- cutree(hc, h=AvgDist2)

In the above code snippet in the cutree function I have used a different cutoff distance for different cities. How I arrived at that distance is a different science altogether, in this case the cutoff distance is around 2 KMs which means that each cluster would be roughly of a diameter of 2 KMs.

This is how these different clusters look like when plotted

How I named these localities? There was a system name tagged to each hotel’s locality, I used the most frequent name in that cluster as the Cluster Name

Please reach out to me at [email protected] for any kind of queries regarding this.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – R vs Shubham**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.