The significance of education on the salary of engineers in Sweden

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last posts, I analysed the significance of experience for different occupational groups. In this post, I will turn the interest towards education. I will again start with engineers and see if I can expand my analysis to all occupational groups.

First, define libraries and functions.

library (tidyverse) 
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library (broom) 
library (car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library (splines)
#install_github("ZheyuanLi/SplinesUtils")
library(SplinesUtils)

readfile <- function (file1){  
  read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%  
    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%  
    drop_na() %>%  
    mutate (year_n = parse_number (year))
}

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CY.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by sector, occupational group (SSYK 2012), sex and educational level (SUN). Year 2014 – 2018 Monthly salary All sectors

We expect that education is an important factor in salaries. As a null hypothesis, we assume that education is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden.

The column level of education is renamed because TukeyHSD doesn’t handle variable names within quotes.

tb <- readfile("000000CY.csv") %>% 
  filter(`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%       
  mutate(edulevel = `level of education`)

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = salary, colour = `level of education`, shape=sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 1: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

model <- lm (log(salary) ~ year_n + sex + edulevel, data = tb)

tb <- bind_cols(tb, as_tibble(exp(predict(model, tb, interval = "confidence"))))

The F-value from the Anova table for years is 40 (Pr(>F) < 2.2e-16), sufficient for rejecting the null hypothesis that education has no effect on the salary holding year as constant. The adjusted R-squared value is 0,833 implying a good fit of the model.

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = fit, colour = edulevel, shape = sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 2: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 1: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-27.38958936.6236266-4.13513490.0001120
year_n0.01890820.00328565.75492460.0000003
sexwomen-0.08374120.0092178-9.08476190.0000000
edulevelpost-graduate education (ISCED97 6)0.11145610.01710296.51679070.0000000
edulevelpost-secondary education 3 years or more (ISCED97 5A)0.00102050.01710290.05966990.9526168
edulevelpost-secondary education, less than 3 years (ISCED97 4+5B)-0.03320120.0171029-1.94125950.0569278
edulevelprimary and secondary education 9-10 years (ISCED97 2)-0.10898100.0175947-6.19397250.0000001
edulevelupper secondary education 3 years (ISCED97 3A)-0.07843050.0171029-4.58579360.0000235
edulevelupper secondary education, 2 years or less (ISCED97 3C)-0.10405030.0171029-6.08377610.0000001
summary(model)$adj.r.squared  
## [1] 0.8327314
Anova(model, type = 2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 1: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.0484384133.119163e-07
sex0.1207084182.532900e+00
edulevel0.3529225640.217740e+00
Residuals0.087752960NANA

How much do the different levels of education affect the salary? We can calculate the differences between the levels with Tukey’s honest significant difference. All significant level differences are shown in the table below.

tukeytable <- TukeyHSD(aov(log(salary) ~ sex + edulevel, data = tb)) %>%
  tidy() %>%
  mutate(percdiff = (1 / exp(estimate) - 1) * 100)

tukeytable  %>% 
  filter(adj.p.value < 0.05) %>%
  arrange(estimate) %>%  
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Tukey HSD 95 % confidence intervals for the pairwise significant differences')
Table 2: Tukey HSD 95 % confidence intervals for the pairwise significant differences
termcomparisonestimateconf.lowconf.highadj.p.valuepercdiff
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-graduate education (ISCED97 6)-0.2160450-0.2822364-0.14985350.000000024.115819
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-graduate education (ISCED97 6)-0.2155065-0.2799325-0.15108040.000000024.049001
edulevelupper secondary education 3 years (ISCED97 3A)-post-graduate education (ISCED97 6)-0.1898866-0.2543126-0.12546060.000000020.911247
edulevelpost-secondary education, less than 3 years (ISCED97 4+5B)-post-graduate education (ISCED97 6)-0.1446573-0.2090834-0.08023130.000000115.564352
edulevelpost-secondary education 3 years or more (ISCED97 5A)-post-graduate education (ISCED97 6)-0.1104356-0.1748616-0.04600960.000044311.676444
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-secondary education 3 years or more (ISCED97 5A)-0.1056094-0.1718008-0.03941790.000164911.138763
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-secondary education 3 years or more (ISCED97 5A)-0.1050709-0.1694969-0.04064480.000111911.078932
edulevelprimary and secondary education 9-10 years (ISCED97 2)-no information about level of educational attainment-0.1045888-0.1707803-0.03839740.000195011.025400
edulevelupper secondary education, 2 years or less (ISCED97 3C)-no information about level of educational attainment-0.1040503-0.1684764-0.03962430.000133110.965630
sexwomen-men-0.0803151-0.1030666-0.05756370.00000008.362851
edulevelupper secondary education 3 years (ISCED97 3A)-post-secondary education 3 years or more (ISCED97 5A)-0.0794510-0.1438770-0.01502500.00667498.269249
edulevelupper secondary education 3 years (ISCED97 3A)-no information about level of educational attainment-0.0784305-0.1428565-0.01400440.00773748.158814
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-secondary education, less than 3 years (ISCED97 4+5B)-0.0713876-0.1375791-0.00519620.02645607.399745
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-secondary education, less than 3 years (ISCED97 4+5B)-0.0708491-0.1352751-0.00642310.02210877.341926
edulevelpost-graduate education (ISCED97 6)-no information about level of educational attainment0.11145610.04703010.17588220.0000370-10.546938

We can conclude from the summary table that there is a positive correlation between longer education and higher salaries.

From the table of Tukey’s honest significant difference, we can see the difference in salaries between the different education lengths. Note that the estimates are negative due to the log transformation, the untransformed differences are in the column percdiff.

Can we approximate how much the salaries increase by one year of education by assigning a numeric value to the factors in the table?

As a first approach, I will use the data in the following table.

I will use a B-spline function to approximate the increase in salaries over age. The rows for “no information about the level of educational attainment” is removed from the table from Statistics Sweden.

numedulevel <- read.csv("edulevel.csv") 

numedulevel %>%
  knitr::kable(
  booktabs = TRUE,
  caption = 'Initial approach, length of education') 
Table 3: Initial approach, length of education
level.of.educationeduyears
primary and secondary education 9-10 years (ISCED97 2)9
upper secondary education, 2 years or less (ISCED97 3C)11
upper secondary education 3 years (ISCED97 3A)12
post-secondary education, less than 3 years (ISCED97 4+5B)14
post-secondary education 3 years or more (ISCED97 5A)15
post-graduate education (ISCED97 6)19
no information about level of educational attainmentNA
tbnum <- tb %>% 
    right_join(numedulevel, by = c("level of education" = "level.of.education")) %>%
  filter(!is.na(eduyears))
## Warning: Column `level of education`/`level.of.education` joining character
## vector and factor, coercing into character vector
modelcont <- lm(log(salary) ~ bs(eduyears, knots = c(14)) + year_n + sex, data = tbnum)

tbnum <- bind_cols(tbnum, as_tibble(exp(predict(modelcont, tbnum, interval = "confidence"))))

The F-value from the Anova table for years is 146 and the adjusted R-squared value is 0,932 implying a good fit of the model. Both the F-value and the adjusted R-squared increased from then using the categorical predictors. (Removing the rows with “no information about level of educational attainment” improves the adjusted R-squared for the model with categorical predictors to 0.931.)

tbnum %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = fit1, colour = eduyears, shape = sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 3: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

summary(modelcont) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 4: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-35.11157424.5503183-7.71628980.0000000
bs(eduyears, knots = c(14))1-0.01137080.0237866-0.47803310.6346302
bs(eduyears, knots = c(14))20.07194530.04689301.53424290.1310323
bs(eduyears, knots = c(14))30.18501770.05677423.25883470.0019749
bs(eduyears, knots = c(14))40.22121790.011145619.84803390.0000000
year_n0.02268180.002256910.05003790.0000000
sexwomen-0.07432750.0063230-11.75515790.0000000
summary(modelcont)$adj.r.squared  
## [1] 0.9316461
Anova(modelcont, type = 2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 4: Anova report from linear model fit
termsumsqdfstatisticp.value
bs(eduyears, knots = c(14))0.34299014145.78660
year_n0.05940721101.00330
sex0.08127571138.18370
Residuals0.030584952NANA

What does the continous function from the model look like?

contspline <- RegBsplineAsPiecePoly(modelcont, "bs(eduyears, knots = c(14))")

tibble(eduyears = 9:19) %>%
  ggplot () + 
    geom_point (mapping = aes(x = eduyears,y = predict(contspline, eduyears))) +
  labs(
    x = "Years of education",
    y = "Salary"
  )

Model fit, SSYK 214, Correlation between education and salary

Figure 4: Model fit, SSYK 214, Correlation between education and salary

And it’s derivative.

tibble(eduyears = 9:19) %>%
  ggplot () + 
    geom_point (mapping = aes(x = eduyears,y = (exp(predict(contspline, eduyears, deriv = 1)) - 1) * 100)) +
  labs(
    x = "Years of education",
    y = "Salary difference (%)"
  )

Model fit, SSYK 214, The derivative for education

Figure 5: Model fit, SSYK 214, The derivative for education

Comparison between the categorical and the continuous predictor. Column withinconf states if the estimate from the numerical model is within the 95 % confidence interval from the Tukey’s honest significant difference table. All estimates from the model with the continuous predictor are within the 95 % confidence intervals from the Tukey HSD table.

tukeytable <- tukeytable %>% 
  rowwise() %>% 
  mutate(comp_from = unlist(strsplit(comparison, ")-"))[1]) %>%
  rowwise() %>% 
  mutate(comp_to = unlist(strsplit(comparison, ")-"))[2]) %>%             mutate(comp_from = paste (comp_from, ")", sep="")) %>%
  left_join(numedulevel, by = c("comp_from" = "level.of.education")) %>%
  left_join(numedulevel, by = c("comp_to" = "level.of.education")) %>%    mutate(numestimate = predict(contspline, eduyears.x) -           predict(contspline, eduyears.y)) %>% 
  mutate(withinconf = numestimate > conf.low && numestimate < conf.high) %>% 
  mutate(percdiffcont = (1 / exp(predict(contspline, eduyears.x) - predict(contspline, eduyears.y)) - 1) * 100)
## Warning: Column `comp_from`/`level.of.education` joining character vector and
## factor, coercing into character vector
## Warning: Column `comp_to`/`level.of.education` joining character vector and
## factor, coercing into character vector
tukeytable %>% 
  select(term, comparison, estimate, adj.p.value, numestimate, withinconf, percdiff, percdiffcont) %>%
  filter(adj.p.value < 0.05) %>%
  arrange(estimate) %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Comparison between categorical and continous predictor')
Table 5: Comparison between categorical and continous predictor
termcomparisonestimateadj.p.valuenumestimatewithinconfpercdiffpercdiffcont
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-graduate education (ISCED97 6)-0.21604500.0000000-0.2212179TRUE24.11581924.7595300
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-graduate education (ISCED97 6)-0.21550650.0000000-0.2123268TRUE24.04900123.6551904
edulevelupper secondary education 3 years (ISCED97 3A)-post-graduate education (ISCED97 6)-0.18988660.0000000-0.1942616TRUE20.91124721.4413939
edulevelpost-secondary education, less than 3 years (ISCED97 4+5B)-post-graduate education (ISCED97 6)-0.14465730.0000001-0.1418336TRUE15.56435215.2384840
edulevelpost-secondary education 3 years or more (ISCED97 5A)-post-graduate education (ISCED97 6)-0.11043560.0000443-0.1117048TRUE11.67644411.8182726
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-secondary education 3 years or more (ISCED97 5A)-0.10560940.0001649-0.1095131TRUE11.13876311.5734729
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-secondary education 3 years or more (ISCED97 5A)-0.10507090.0001119-0.1006220TRUE11.07893210.5858529
edulevelprimary and secondary education 9-10 years (ISCED97 2)-no information about level of educational attainment-0.10458880.00019500.0000000FALSE11.0254000.0000000
edulevelupper secondary education, 2 years or less (ISCED97 3C)-no information about level of educational attainment-0.10405030.00013310.0088912FALSE10.965630-0.8851746
sexwomen-men-0.08031510.00000000.0000000FALSE8.3628510.0000000
edulevelupper secondary education 3 years (ISCED97 3A)-post-secondary education 3 years or more (ISCED97 5A)-0.07945100.0066749-0.0825568TRUE8.2692498.6060365
edulevelupper secondary education 3 years (ISCED97 3A)-no information about level of educational attainment-0.07843050.00773740.0269563FALSE8.158814-2.6596254
edulevelprimary and secondary education 9-10 years (ISCED97 2)-post-secondary education, less than 3 years (ISCED97 4+5B)-0.07138760.0264560-0.0793844TRUE7.3997458.2620369
edulevelupper secondary education, 2 years or less (ISCED97 3C)-post-secondary education, less than 3 years (ISCED97 4+5B)-0.07084910.0221087-0.0704932TRUE7.3419267.3037289
edulevelpost-graduate education (ISCED97 6)-no information about level of educational attainment0.11145610.00003700.2212179FALSE-10.546938-19.8458026

Now. let´s perform some diagnostics on the models. First, a look at the residuals for the model shows no apparent problem. We can see that the outlier at row 55 has disappeared in the plots for the continuous model.

Three out of four outliers when using categorical predictors were from the factor “no information about level of educational attainment”.

For the continuous predictors there are two outliers in the factor “upper secondary education, 2 years or less (ISCED97 3C)” and two outliers in the factor “upper secondary education 3 years (ISCED97 3A)” indicating that the model could be improved.

tb[20,]$edulevel
## [1] "no information about level of educational attainment"
tb[41,]$edulevel
## [1] "no information about level of educational attainment"
tb[55,]$edulevel
## [1] "no information about level of educational attainment"
tb[57,]$edulevel
## [1] "upper secondary education, 2 years or less (ISCED97 3C)"
tbnum[12,]$edulevel
## [1] "upper secondary education, 2 years or less (ISCED97 3C)"
tbnum[18,]$edulevel
## [1] "upper secondary education, 2 years or less (ISCED97 3C)"
tbnum[25,]$edulevel
## [1] "upper secondary education 3 years (ISCED97 3A)"
tbnum[29,]$edulevel
## [1] "upper secondary education 3 years (ISCED97 3A)"
plot(model, which = 1)

Residuals vs Fitted of model fit

Figure 6: Residuals vs Fitted of model fit

plot(modelcont, which = 1)

Residuals vs Fitted of model fit

Figure 7: Residuals vs Fitted of model fit

The Normal Q-Q shows some possible outliers.

plot(model, which = 2)

Normal Q-Q

Figure 8: Normal Q-Q

plot(modelcont, which = 2)

Normal Q-Q

Figure 9: Normal Q-Q

Again, the Standardised residuals show some possible outliers.

plot(model, which = 3)

Scale-Location

Figure 10: Scale-Location

plot(modelcont, which = 3)

Scale-Location

Figure 11: Scale-Location

The outliers are also found in the Leverage plot.

plot(model, which = 5)

Residuals vs Leverage

Figure 12: Residuals vs Leverage

plot(modelcont, which = 5)

Residuals vs Leverage

Figure 13: Residuals vs Leverage

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)