Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello, today I am going to do an EDA (exploratory data analysis) on AirBnB in the New York area. This data set is available here on Kaggle

https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Lets read the data into R and take a look of it

So I can see there are 17 columns and over 48000 records with information covering the price and location of the AirBnB. Looking at the location column lets see where most of the AirBnB’s in New York are

Above you can see the code and the out put form the code with is the plot above. I can see that most of the AirBnB’s are located in the Williamsburg and Bedford-Stuyvesant areas of Brooklyn. Manhattan areas are also very prevalent in the top 30 neighbourhoods. Now lets look at how the price of the AirBnB changes by area

Looking how the price varies by the location, doesn’t really show much currently and shows we have the most expensive AirBnB in some area of Brooklyn. Intuitively this doesn’t seem correct to me

A quick histogram of the price shows there maybe some questionable data within the AirBnB data set. I would say it looks like 90% of the AirBnB’s in the New York area are less then $1000 dollar per night. There are some more, however because they are so much more then others in the data set then I suspect this is bad data. In-fact, if we look at the most expensive listings some are share rooms. Non one is going to pay$10000 for that. This is bad data and must be removed.

Removing the bad data and doing the plot again there is now a much clearer picture. There seems to be a high density of expensive AirBnB’s in upper Manhattan. This shows all listings however, the price probably differs for room type. Not so surprising though as you can see higher prices seem to be in the tourist hot spots, Moving on now lets start by building a model to look at predicting the price.

I am just going to be using simple linear modelling however I have split the data set into training and testing sets for the model. The first variable I am going to use is room type so is it a shared room, entire dwelling or private room.

Above we can see the summary of the first model fitted. I can see that compare to an entire home private rooms are about $114 cheaper per night and a shared room is about$130 dollars cheaper per night. The bad part is the residual error is pretty poor at 128 which means on average this model would be $128 dollars out for predicting price. I re-ran the model now with neighbourhood, number of review and review per month included. Above you can see a summary of the variables that effect the price the most in a negative and positive way. Room type seems to the have the biggest negative effect. Location Tribeca and flatiron seem to be the most sort after locations. The residual error is down to 105.7. In order to improve it I will need to do some feature engineering. The first thing I am going to look at is the description column. I wonder if there are some words in there that highlight more expensive homes. One word i can think of is luxury. Anything described as luxury makes it seem more expensive then something not. I have used the string detect function on the name variable in order to find properties with luxury in their name. Adding that to the linear model shows that properties with that in the name often are$59 higher per night then others.

Whats clear by looking at the number of each of the words in the names of the title, is a lot of people like highlighting where the place in New York is. Other words that can be used for the model cozy, spacious, beautiful and large. I’m going to add clean to the list as well.

Now I can see that apart from luxury none of the other words in the descriptions have too much effect on the price. Cozy is maybe a euphemism for the property being small and therefore that’s why it has a negative effect on the price. Now I have my model lets apply it to the unseen data and see how it does on that.

Running the model on the unseen data saw sum surprising results. There are a few places with negative prices predicted. This clearly isn’t correct. Also I may have to model against the log of the actual price. By taking the log it dramatically improves the predictive power of the model residual error is now 0.614 and r squared of 0.52.

Above we can see the results of the long transformed model. There are still some higher priced rentals that the model doesn’t really predict. However lets use it to see the most over and under priced (according to the model)

Above you can see the most expensive compared to the predicted price and the link to it on AirBnB. All I can say is wow. It looks incredibile. Pictures are something that could never be in the model.

Above you can see a summary of the 10 most undervalued. They are all located in Manhattan and seem to be split between Tribeca district and Midtown. Thats it for today’s blog slightly longer then normal but hope you enjoyed.