The significance of experience on the salary of engineers in Sweden

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What causes the salary to change during the working life of engineers in Sweden?

I will examine data from Statistics Sweden to see what information you can interpret from it.

Statistics Sweden is the Swedish government agency responsible for producing official statistics regarding Sweden, https://www.scb.se/en/.

“Everybody should have the opportunity to find facts and break the myths with our statistics!” Statistics Sweden twitter account, sverigeisiffror.

Salaries are reported for every occupational group SSYK (Standard for Swedish occupational classification). I will examine SSYK 214, architects, engineers and related professionals.

First, define libraries and functions.

library (tidyverse) 
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library (broom) 
library (car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library (splines)
#install_github("ZheyuanLi/SplinesUtils")
library(SplinesUtils)

readfile <- function (file1){  
  read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%  
    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%  
    drop_na() %>%  
    mutate (year_n = parse_number (year))
}

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 00000031.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

The table: Average monthly pay (total pay), non-manual workers private sector (SLP), SEK by occupational group (SSYK), age, sex and year. Year 2014 – 2018

The average age within each age group is used as a numeric value for graphical presentation and the linear model.

As the salary increases exponentially each year, I have used the logarithm for years in my model.

As a proxy for experience years since graduation can be used. I will simplify by using age as a proxy for experience. I will use a B-spline function to approximate the increase in salaries over age. Spline function is used instead of a polynomial to avoid oscillation.

We expect that both years and age are important factors on the salary. As a null hypothesis, we assume that years and age is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden.

Every year the annual salary negotiation will increase the salaries. As age increases, we acquire more skill at work.

tb <- readfile("00000031.csv") %>% 
  rowwise() %>% 
  mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>%
  rowwise() %>% 
  mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>%
  mutate(age_n = (age_l + age_h) / 2) 
   
tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = salary, colour = age, shape=sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 1: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

    
model <- lm (log(salary) ~ year_n + sex + bs(age_n, knots = c(30, 40, 50, 60)), data = tb)

tb <- bind_cols(tb, as_tibble(exp(predict(model, tb, interval = "confidence"))))

From the summary table of the linear model, we can see that the wages have increased by 1,91 % +/- 0,35 % per year with 95 % confidence, holding age as constant. The R-value from the Anova table for years is 120, sufficient for rejecting the null hypothesis that years has no effect on the salary. Holding year as constant, the p values for age are small (< 2e-16) and the R-value from the Anova table value for age is 703, sufficient for rejecting the null hypothesis that experience has no effect on the salary. The adjusted R-squared value is 0,984 implying a good fit of the model.

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = fit, colour = age, shape=sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 2: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

    
summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 1: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-28.18184643.5078741-8.03388200.0000000
year_n0.01908960.001740010.97096800.0000000
sexwomen-0.03387030.0050397-6.72071070.0000000
bs(age_n, knots = c(30, 40, 50, 60))10.01369220.02249210.60875660.5443286
bs(age_n, knots = c(30, 40, 50, 60))20.24176940.020715611.67088040.0000000
bs(age_n, knots = c(30, 40, 50, 60))30.45894400.017213026.66258400.0000000
bs(age_n, knots = c(30, 40, 50, 60))40.49634010.017680528.07278050.0000000
bs(age_n, knots = c(30, 40, 50, 60))50.50457820.019568825.78489660.0000000
bs(age_n, knots = c(30, 40, 50, 60))60.48785690.016988828.71642740.0000000
bs(age_n, knots = c(30, 40, 50, 60))70.48505480.014371833.75041460.0000000

summary(model)$r.squared   
## [1] 0.9839054
  
Anova(model, type=2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 1: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.06878311120.362140
sex0.0258120145.167950
bs(age_n, knots = c(30, 40, 50, 60))2.81365697703.366080
Residuals0.048003384NANA

Let’s have a look at the spline function from the regression model. I use the package SplineUtils to extract the polynomial train from the model coefficients.

myspline <- RegBsplineAsPiecePoly(model, "bs(age_n, knots = c(30, 40, 50, 60))")

tibble(age_n = 21:65) %>%
  ggplot () + 
    geom_point (mapping = aes(x = age_n,y = predict(myspline, age_n))) +
  labs(
    x = "Age",
    y = "Salary"
  )

Model fit, SSYK 214, Correlation between age and salary

Figure 3: Model fit, SSYK 214, Correlation between age and salary

The derivative of the spline function for age is equal to the annual salary raise that comes from experience. I also add the salary raise that comes from the annual salary negotiation resulting in the total salary increase distributed over age.

summod <- tidy(summary (model))

myfun <- function(x){
  return (predict(myspline, x, deriv = 1) + summod$estimate[2])
}

tibble(age_n = 21:65) %>%
  ggplot () + 
    geom_point (mapping = aes(x = age_n,y = myfun(age_n))) +
  labs(
    x = "Age",
    y = "Salary raise (%)"
  )

Model fit, SSYK 214, The derivative for age

Figure 4: Model fit, SSYK 214, The derivative for age

Now. let´s perform some diagnostics on the model. First, a look at the residuals for the model shows no apparent problem.


plot(model, which = 1)

Residuals vs Fitted of model fit

Figure 5: Residuals vs Fitted of model fit

The Normal Q-Q shows some possible outliers.


plot(model, which = 2)

Normal Q-Q

Figure 6: Normal Q-Q

Again, the Standardised residuals show some possible outliers.


plot(model, which = 3)

Scale-Location

Figure 7: Scale-Location

The outliers are also found in the Leverage plot.


plot(model, which = 5)

Residuals vs Leverage

Figure 8: Residuals vs Leverage

The outliers 38, 56 and 75 are small groups with larger variance at either end of the age spectrum. Record 72 are women age 55-59.

tb[38,]$age
## [1] "18-24 years"
tb[56,]$age
## [1] "65-66 years"
tb[75,]$age
## [1] "65-66 years"

tb[72,]
## Source: local data frame [1 x 12]
## Groups: 
## 
## # A tibble: 1 x 12
##   `occuptional  (~ age   sex   year  salary year_n age_l age_h age_n    fit
##                          
## 1 214 Engineering~ 55-5~ women 2017   45000   2017    55    59    57 48433.
## # ... with 2 more variables: lwr , upr 

Wage statistics

The Mediation Institute is responsible for Sweden’s official salary statistics. This applies to three types of statistics: cyclical wage statistics, wage structure statistics and statistics supplied to the European statistical body Eurostat, https://www.mi.se/other-languages/about-the-mediation-office-the-swedish-model-and-wage-statistics-in-english/.

The statistics are collected and processed by Statistics Sweden. It is reported both on Statistics Sweden’s website and the Mediation Institute’s website.

https://www.r-bloggers.com/

https://rweekly.org

Last edited: 2019-12-10

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)