The Good oL’ Titanic Kaggle Competition pt. 1
After that I began playing around with logistic regression. So far, none of my attempts at logistic regression have improved my score but I have some ideas for tomorrow (already reached my submission limit for today). I do realize now that I need to have a plan with my logistic regression models, I need to determine which features have the best probability of providing signal instead of blindly plugging in different ones. Since the code for this portion is short, I included it below.
It’s been over two months since I finished the Data Science certificate program through the University of Washington. Since then I’ve been trying to figure out my next step. The annoying thing about the internet is that it probably gives you too many options. Every time I search “learning data science”, or “how to become a data scientist”, or “what data science tools should I learn”, I get completely inundated with different information. I can’t tell you how many times one article has led to several others and in the end I can’t even remember where I started. In all of this noise, I’ve realized one thing, you just HAVE TO START SOMEWHERE. I’ve done Kaggle
in the past and I’m pretty familiar with R, so I figured I would go back to the Titanic problem and see what happens. I won’t rehash the entire problem but basically you are given a set of features about passengers on the Titanic which you have to use to create a model to predict whether they died or survived. I have to give a shoutout to Trevor Stevens and his blog for getting me started.
For my analysis, I started by doing some simple proportion tables to see what impact different categorical features had on survival. You can see my code on Github for all the details. Passenger Class and Sex were the most obvious features to test since they have 3 and 2 factors respectively and they seem like they can provide some insight on survival (unlike the Embarked feature). I found that 3rd class passengers and males were the most likely to die. I created a few submissions based on sex and class. My females only prediction is currently my best score at 0.76555.
To leave a comment
for the author, please follow the link and comment on their blog: numbr crunch - Blog
offers daily e-mail updates
news and tutorials
on topics such as: Data science
, Big Data, R jobs
, visualization (ggplot2
), programming (RStudio
, Web Scraping
) statistics (regression
, time series
) and more...
If you got this far, why not subscribe for updates
from the site? Choose your flavor: e-mail
, or facebook