POWER CHROME AMAZING
Johnna Ayres Bill Best Mark Fridson Trent Jerde Marshall Yi
Ever searched for an apartment and enjoyed it? Probably not, especially in New York City. Would it be helpful if more information was available about specific apartment listings? Hell yeah.
Kaggle is a platform for data science competitions. Our team, Power Chrome Amazing, entered the Kaggle competition hosted by Two Sigma and RentHop. RentHop uses math and technology to sort apartment listings by quality. They face a challenge, however: how can they predict the Interest Level of apartment listings?
In this Kaggle competition, the training dataset included details of some 49,000 apartment listings in the New York City area, along with ratings of whether each listing generated a High, Medium, or Low Interest Level. The Interest Level was based on the number of inquiries for a particular listing.
The testing dataset included the details of some 75,000 apartment listings, but without the ratings for Interest Level. The goal of the project was to use the information provided in the training dataset to predict the probabilities that a given listing in the testing dataset would receive a High, Medium, or Low Interest Level. This kind of predictive modeling could help RentHop predict the Interest Level of apartment listings.
Exploratory Data Analysis
We first examined the data using exploratory data analysis (EDA) to answer questions like the following:
- In the training dataset, what proportion of the listings fell under High, Medium, and Low Interest Levels?
- What variables are used to describe the listings, such as Price, Features, and Photos; and what do the distributions of the variables look like?
- Are there outliers in the data, and if so, how might we account for them in our analysis?
Below are a few of topics we considered.
Locations of Apartment Units
In real estate, location matters. Figures 1-3 below show the locations of the apartment units having High Interest Level (top, green), Medium Interest Level (middle, blue), and Low Interest Level (bottom, red).
Figure 1. Locations of High Interest Level units (green). Note that these listings are especially prevalent in lower Manhattan.
Figure 2. Locations of Medium Interest Level units (blue).
Figure 3. Locations of Low Interest Level units (red).
What do you look for in an apartment? Landlords usually provide descriptions of an apartment for rent. In the Kaggle datasets, each apartment comes with a list of Features that describes it. We used decision trees on the training dataset to determine the 69 most frequent attributes and clustered them into 19 predictors. The three most important Features were (i) Fee: No Fee or Low Fee; (ii) Prewar; and (iii) Pets. These Features are illustrated in Figure 4 below. The two least important Features were Playroom and Proximity to Subway.
Figure 4. Important Features derived from decision trees. The doggy belongs to one of the authors.
A typical apartment listing includes a variety of written descriptions about the unit. For example, is there a doorman? Are subways nearby? Are laundry facilities on the property? We used a technique called term frequency – inverse document frequency (td-idf) to explore the word counts within the Features column of the training dataset. The td-idf technique provides a numerical statistic that indicates the importance of a word in a document. Below, we show the top 50 words that appeared in apartment listings with High (top), Medium (middle) and Low (bottom) Interest Levels. Words in larger font received a higher score for importance.
Figure 5. In the High Interest Level group (above), words such as “diplomats,” “delivery,” and “rentable” are important.
Figure 6. In the Medium Interest Level units (above), words such as “health,” “filtration,” and “speaker” have higher scores.
Figure 7. In the Low Interest Level group (above), “lobby,” “assigned,” and “attended” are considered important.
Overall, these word plots are informative about trends in the descriptions of apartment units. In our machine learning models, therefore, we included the presence or absence of key Features.
Rental Price and Location
Of the variables that influence the Interest Level of an apartment listing, two stand out: Price and Location. Indeed, most people have prior knowledge and beliefs regarding how much they want to pay for rent and what neighborhoods they want to live in. In accordance with this intuition, our tree models consistently showed that the best scoring predictors, as indicated by the Gini Index, were Price, Latitude, and Longitude. Latitude and longitude, of course, reflect the location of a listing. Thus, Price and Location are two strong predictors of the Interest Level generated by a given apartment.
We initially tried to categorize the photos into image hue, saturation, and lightness, hoping that these image characteristics could help us predict Interest Levels. Figures are shown below. Don’t squint, since we didn’t find much of immediate value in this analysis.
Figure 8. Image hue. The different colors represent the three Interest Levels. There is not much difference.
Figure 9. Image saturation.
Figure 10. Image lightness.
Our preliminary analysis of the photos suggested that the images do not provide enough information to be useful predictors of Interest Level because they are too similar to each other across the listings. However, if given more time to work on the data, we feel that the photos probably do provide information that could be used to improve our models. This element of the project is a work in progress.
Machine Learning Algorithms
We trained several machine learning models on the predictors in the training dataset to predict the Interest Levels in the testing dataset. The results of these models were then combined into an ensemble model, which yielded the final prediction of Interest Level in the training dataset. We deployed the following machine learning methods:
K Nearest Neighbors (KNN), a non-parametric method for regression and classification.
Decision Trees, a method of creating tree-like graphs that model the decision steps in a learning situation.
K-means Clustering, an unsupervised learning method that classifies values of latitude and longitude into clusters the generally reflect neighborhood boundaries in New York City.
XGBoost (Extreme Gradient Boosting), a fast machine learning algorithm that has dominated Kaggle competitions in recent years.
Logistic Regression, a regression model where the dependent variable is categorical. In the current datasets, the categorical variable is Interest Level and has three levels: High, Medium, or Low.
Support Vector Machine (SVM), a discriminative classifier that takes input data and creates a hyperplane that can be used to categorize new data.
Ensemble Model Created with Support Vector Machine (SVM)
The input of machine learning models included KNN,Decision Trees, XGBoost, and SVM.
The parameters were the following:
SVM Kernel: Radial; Cost: 1; Gamma: 0.08333333; Epsilon: 0.1
Overall, the ensemble model produced a good level of predictive accuracy.
Our analysis enabled us to predict the Interest Level that an apartment listing will generate given variables such as Price, Location, and Features. This data science approach will help companies like RentHop make smart decisions regarding how to efficiently market apartments, since they can now predict the expected Interest Level of an apartment. This information could also potentially aid people who are looking for an apartment, since they can use data-driven rental practices to evaluate rentals, rather than skimming through Craigslist and hoping not to get hornswoggled.
The post Predicting RentHop Apartment Listings in the Two Sigma Kaggle Competition appeared first on NYC Data Science Academy Blog.