Marriage is good for your income

April 29, 2012
By

(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)

For those of you who are into machine learning, here you can find a cool collection of databases to play around with your favorite algorithm. I choose one out of the available 200 and fit a logistic regression model. The idea is to see what kind of properties are common for those who earn above 50K a year. Our data is such that the “y” variable is binary. A value of 1 is given if the individual earns above 50K and 0 if below. We know many things about the individual. Level of education in years, age, is she married, where from, which sector is she working in, how many working hours per week, race, and more. We can fit logistic regression, which is quite standard for a binary dependent variable, and see which variables are important.

Result:

Coefficients - Logistic RegressionA variable between 0 and 1 means that it has negative influence on the probability to earn above 50K. The higher the coefficient the more positive influence it has. However, most of the coefficients are insignificant:

EstimateStd. Errorz valueP.val
(Intercept)-8.93090.4451-20.060.0000
age0.04280.002120.570.0000
edyears0.34270.011529.850.0000
whperw0.03210.002214.330.0000
not-married-1.00960.0550-18.370.0000
Cambodia0.89040.89820.990.3216
 Canada0.73030.38161.910.0556
 China-1.37510.6159-2.230.0256
 Columbia-1.42480.9515-1.500.1343
Cuba-0.19310.5177-0.370.7091
Dominican-Republic-0.30220.8232-0.370.7136
Ecuador-1.04331.3723-0.760.4471
El-Salvador-0.88460.7955-1.110.2661
 England-0.75310.4852-1.550.1207
France-0.84181.1822-0.710.4764
Germany0.34240.39100.880.3812
 Greece-1.19650.6605-1.810.0701
Guatemala-0.93501.1504-0.810.4164
Haiti-0.91431.1334-0.810.4199
 Honduras-0.04091.3071-0.030.9750
Hong Kong1.09951.26040.870.3830
Hungary-12.8463882.7434-0.010.9884
India-0.72640.4826-1.510.1323
Iran0.19450.53500.360.7161
Ireland0.25400.75160.340.7353
Italy1.17700.51472.290.0222
Jamaica0.66850.52891.260.2063
Japan0.20820.57820.360.7187
Laos-12.3934432.8021-0.030.9772
Mexico-0.68780.3607-1.910.0566
Nicaragua-12.3164244.3564-0.050.9598
Guam-USVI-etc-12.5394421.4650-0.030.9763
Peru-0.32401.1425-0.280.7767
Philippines-0.40330.4323-0.930.3509
 Poland-0.96500.6266-1.540.1236
Portugal0.03860.83860.050.9632
Puerto-Rico-0.00040.5131-0.000.9994
Scotland-12.5727447.6334-0.030.9776
South-1.47880.6520-2.270.0233
Taiwan0.03040.55180.060.9560
Thailand-0.80680.9884-0.820.4143
Trinadad&amp Tobago-12.2504307.6617-0.040.9682
United-States0.00490.18790.030.9791
Vietnam-1.50210.8277-1.810.0696
Yugoslavia0.04951.32210.040.9702
Federal-gov1.18120.19396.090.0000
 Local-gov0.72430.17144.230.0000
Never-worked-9.7524618.4062-0.020.9874
Private0.77850.14915.220.0000
 Self-emp-inc1.41180.18637.580.0000
Self-emp-not-inc0.42380.16782.530.0115
State-gov0.55120.18872.920.0035
Without-pay-10.6696622.2566-0.020.9863
 Asian-Pac-Islander0.47900.39311.220.2230
Black0.01790.33750.050.9576
race – Other-0.62950.5792-1.090.2771
White0.45670.32301.410.1573
Male0.83820.064313.050.0000

So… what is important?

  1. We can see that males has better chance to earn more.
  2. Nice to see that race is not important, e.g. being black has no significant effect.
  3. Government is a good thing.
  4. Being self employed is a good thing.
  5. Being from Italy is good thing..  :-o
  6. working hard is a good thing, “whperw” is working hours per week. However, the value of the coefficient is not high, so don’t work too hard.
  7. Older is better, again, the coefficient, despite its importance, is not high.
  8. Being educated is important. “edyears” is of years of schooling.
  9. Being married is important.  :-D , if you are not married there is a significant negative impact on the chance to earn more than 50K per year.

Notes:

We have a serious endogeneity problem here, in more than one place. For example, you probably wait until you have some money saved in order to get marry to begin with. As another example, you probably open your own company only after you earn enough so that you pay less taxes as a company inc.

So, we can interpret these results more as common features shared by the rich group, and less for causality. It can be used for example to slice the market into potential buyers (people with money), according to their characteristics without the need to go into their bank account statement. Thanks for reading, code and references below.

Related:






?View Code RSPLUS
t2 = read.table("/incomedat.txt", sep = ",", header = F)
## Some bookkeeping, drop what we don't use, rename what we do.
head(t2, 4) ; dim(t2) ; names(t2) ; class(t2)
t2 = t2[-NROW(t2),-c(3,4,8,11,12)]
summary(t2)
names(t2)<-c("age","wclass","edyears","mstatus","occ","race","gender","whperw","region","y")
mstatus = NULL
mstatus[as.numeric(t2$mstatus)==c(3,4)]<-"married" 
mstatus[as.numeric(t2$mstatus)!=c(3,4)]<-"not-married"
head(mstatus)
t2$mstatus <-as.factor(mstatus)
levels(t2$mstatus)
y = as.factor(as.numeric(t2$y) - 2)
t2$y = y ; levels(t2$y) 
train = t2[1:round(NROW(t2)*(2/3)),] 
test = t2[(round(NROW(t2)*(2/3))+1):NROW(t2),] # We might want to forecast later 
dim(train) ; dim(test)
names(train) ; class(train)
lm2 = glm(train$y ~ train$age+ train$edyears+train$whperw+as.factor(train$mstatus)+ as.factor(train$region)+
(train$wclass)+	as.factor(train$race)+as.factor(train$gender), family = binomial(link = "logit"),na.action = na.pass)
summary(lm2)

To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.