For those of you who are into machine learning, here you can find a cool collection of databases to play around with your favorite algorithm. I choose one out of the available 200 and fit a logistic regression model. The idea is to see what kind of properties are common for those who earn above 50K a year. Our data is such that the “y” variable is binary. A value of 1 is given if the individual earns above 50K and 0 if below. We know many things about the individual. Level of education in years, age, is she married, where from, which sector is she working in, how many working hours per week, race, and more. We can fit logistic regression, which is quite standard for a binary dependent variable, and see which variables are important.
A variable between 0 and 1 means that it has negative influence on the probability to earn above 50K. The higher the coefficient the more positive influence it has. However, most of the coefficients are insignificant:
|Estimate||Std. Error||z value||P.val|
|race – Other||-0.6295||0.5792||-1.09||0.2771|
So… what is important?
- We can see that males has better chance to earn more.
- Nice to see that race is not important, e.g. being black has no significant effect.
- Government is a good thing.
- Being self employed is a good thing.
- Being from Italy is good thing..
- working hard is a good thing, “whperw” is working hours per week. However, the value of the coefficient is not high, so don’t work too hard.
- Older is better, again, the coefficient, despite its importance, is not high.
- Being educated is important. “edyears” is of years of schooling.
- Being married is important. , if you are not married there is a significant negative impact on the chance to earn more than 50K per year.
We have a serious endogeneity problem here, in more than one place. For example, you probably wait until you have some money saved in order to get marry to begin with. As another example, you probably open your own company only after you earn enough so that you pay less taxes as a company inc.
So, we can interpret these results more as common features shared by the rich group, and less for causality. It can be used for example to slice the market into potential buyers (people with money), according to their characteristics without the need to go into their bank account statement. Thanks for reading, code and references below.
t2 = read.table("/incomedat.txt", sep = ",", header = F) ## Some bookkeeping, drop what we don't use, rename what we do. head(t2, 4) ; dim(t2) ; names(t2) ; class(t2) t2 = t2[-NROW(t2),-c(3,4,8,11,12)] summary(t2) names(t2)<-c("age","wclass","edyears","mstatus","occ","race","gender","whperw","region","y") mstatus = NULL mstatus[as.numeric(t2$mstatus)==c(3,4)]<-"married" mstatus[as.numeric(t2$mstatus)!=c(3,4)]<-"not-married" head(mstatus) t2$mstatus <-as.factor(mstatus) levels(t2$mstatus) y = as.factor(as.numeric(t2$y) - 2) t2$y = y ; levels(t2$y) train = t2[1:round(NROW(t2)*(2/3)),] test = t2[(round(NROW(t2)*(2/3))+1):NROW(t2),] # We might want to forecast later dim(train) ; dim(test) names(train) ; class(train) lm2 = glm(train$y ~ train$age+ train$edyears+train$whperw+as.factor(train$mstatus)+ as.factor(train$region)+ (train$wclass)+ as.factor(train$race)+as.factor(train$gender), family = binomial(link = "logit"),na.action = na.pass) summary(lm2)