In this post we’re going to model the prices of Airbnb appartments in London. In other words, the aim is to build our own price suggestion model. We will be using data from http://insideairbnb.com/ which we collected in April 2018. This work is inspired from the Airbnb price prediction model built by Dino Rodriguez, Chase Davis, and Ayomide Opeyemi. Normally we would be doing this in R but we thought we’d try our hand at Python for a change.
We present a shortened version here, but the full version is available on our GitHub.
First, we import the listings gathered in the csv file.
import pandas as pd listings_file_path = 'listings.csv.gz' listings = pd.read_csv(listings_file_path, compression="gzip", low_memory=False) listings.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count', 'reviews_per_month'], dtype='object')
The data has 95 columns or features. Our first step is to perform feature selection to reduce this number.
Selection on Missing Data
Features that have a high number of missing values aren’t useful for our model so we should remove them.
import matplotlib.pyplot as plt %matplotlib inline percentage_missing_data = listings.isnull().sum() / listings.shape ax = percentage_missing_data.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5)) ax.set_xlabel('Feature') ax.set_ylabel('Percent Empty / NaN') ax.set_title('Feature Emptiness') plt.show()
As we can see, the features
jurisdiction_names mostly have missing values. The features
security_deposit are more than 30% empty which is too much in our opinion. The
zipcode feature also has some missing values but we can either remove these values or impute them within reasonable accuracy.
useless = ['neighbourhood', 'neighbourhood_group_cleansed', 'square_feet', 'security_deposit', 'cleaning_fee', 'has_availability', 'license', 'jurisdiction_names'] listings.drop(useless, axis=1, inplace=True)
Selection on Sparse Categorical Features
Let’s have a look at the categorical data to see the number of unique values.
categories = listings.columns[listings.dtypes == 'object'] percentage_unique = listings[categories].nunique() / listings.shape ax = percentage_unique.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5)) ax.set_xlabel('Feature') ax.set_ylabel('Percent # Unique') ax.set_title('Feature Emptiness') plt.show()
We can see that the
amenities features have a large number of unique values. It would require some natural language processing to properly wrangle these into useful features. We believe we have enough location information with
zipcode so we’ll remove
street. We also remove
calendar_last_updated features as these are too complicated to process for the moment.
to_drop = ['street', 'amenities', 'calendar_last_scraped', 'calendar_updated'] listings.drop(to_drop, axis=1, inplace=True)
Now, let’s have a look at the
zipcode feature. The above visualisation shows us that there are lots of different postcodes, maybe too many?
print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 24774
Indeed, there are too many zipcodes. If we leave this feature as is it might cause overfitting. Instead, we can regroup the postcodes. At the moment, they are separated as in the following example: KT1 1PE. We’ll keep the first part of the zipcode (e.g. KT1) and accept that this gives us some less precise location information.
listings['zipcode'] = listings['zipcode'].str.slice(0,3) listings['zipcode'] = listings['zipcode'].fillna("OTHER") print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 461
A lot of zipcodes contain less than 100 apartments and a few zipcodes contain most of the apartments. Let’s keep these ones.
relevant_zipcodes = count_per_zipcode[count_per_zipcode > 100].index listings_zip_filtered = listings[listings['zipcode'].isin(relevant_zipcodes)] # Plot new zipcodes distribution count_per_zipcode = listings_zip_filtered['zipcode'].value_counts() ax = count_per_zipcode.plot(kind='bar', figsize = (22,4), color = '#E35A5C', alpha = 0.85) ax.set_title("Zipcodes by Number of Listings") ax.set_xlabel("Zipcode") ax.set_ylabel("# of Listings") plt.show() print('Number of entries removed: ', listings.shape - listings_zip_filtered.shape)
Number of entries removed: 5484
This distribution is much better, and we only removed 5484 rows from our dataframe which contained about 53904 rows.