The significance of the region on the salary of engineers in Sweden

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So far I have analysed the effect of experience, education, gender and year on the salary of engineers in Sweden. In this post, I will have a look at the effect of the region on the salary of engineers in Sweden.

Statistics Sweden use NUTS (Nomenclature des Unités Territoriales Statistiques), which is the EU’s hierarchical regional division, to specify the regions.

First, define libraries and functions.

library (tidyverse) 
## -- Attaching packages --------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library (broom) 
library (car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library (swemaps) # devtools::install_github('reinholdsson/swemaps')
library(sjPlot)
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
readfile <- function (file1){  
  read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%  
    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%  
    drop_na() %>%  
    mutate (year_n = parse_number (year))
}
nuts <- read.csv("nuts.csv") %>%
  mutate(NUTS2_sh = substr(NUTS2, 1, 4))

map_ln_n <- map_ln %>%
  mutate(lnkod_n = as.numeric(lnkod)) 

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000CG.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

The table: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2018 Monthly salary All sectors

We expect that the region is an important factor in salaries. As a null hypothesis, we assume that the region is not related to the salary and examine if we can reject this hypothesis with the data from Statistics Sweden.

tb <- readfile ("000000CG.csv") %>%
  filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>% 
  left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en")) 
## Warning: Column `region`/`NUTS2_en` joining character vector and factor,
## coercing into character vector
tb_map <- readfile ("000000CG.csv") %>%
  filter (`occuptional  (SSYK 2012)` == "214 Engineering professionals") %>%
  left_join(nuts, by = c("region" = "NUTS2_en")) 
## Warning: Column `region`/`NUTS2_en` joining character vector and factor,
## coercing into character vector
tb_map %>%
  filter (sex == "men") %>%
  right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%
  ggplot() +
    geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +
      facet_grid(. ~ year) + 
    coord_equal() 

SSYK 214, Architects, engineers and related professionals, men, Year 2014 - 2018

Figure 1: SSYK 214, Architects, engineers and related professionals, men, Year 2014 – 2018

tb_map %>%
  filter (sex == "women") %>%
  right_join(map_ln_n, by = c("Länskod" = "lnkod_n")) %>%
  ggplot() +
    geom_polygon(mapping = aes(x = ggplot_long, y = ggplot_lat, group = lnkod, fill = salary)) +
      facet_grid(. ~ year) + 
    coord_equal() 

SSYK 214, Architects, engineers and related professionals, women, Year 2014 - 2018

Figure 2: SSYK 214, Architects, engineers and related professionals, women, Year 2014 – 2018

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n, y = salary, colour = region, shape=sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 3: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

The F-value from the Anova table for the region is 40 (Pr(>F) < 2.2e-16), sufficient for rejecting the null hypothesis that the region has no effect on the salary holding year as constant. The adjusted R-squared value is 0,882 implying a good fit of the model.

model <-lm (log(salary) ~ year_n + sex + region , data = tb)

tb <- bind_cols(tb, as_tibble(exp(predict(model, tb, interval = "confidence"))))

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = year_n,y = fit, colour = region, shape=sex)) + 
  labs(
    x = "Year",
    y = "Salary (SEK/month)"
  )

Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 4: Model fit, SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 1: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-29.22693353.7348614-7.8254400.00e+00
year_n0.01983030.001852610.7039850.00e+00
sexwomen-0.05870560.0052400-11.2034490.00e+00
regionSE12 East-Central Sweden-0.05748550.0104799-5.4852976.00e-07
regionSE21 Småland and islands-0.12863850.0104799-12.2747620.00e+00
regionSE22 South Sweden-0.04577250.0104799-4.3676424.26e-05
regionSE23 West Sweden-0.05135460.0104799-4.9002856.00e-06
regionSE31 North-Central Sweden-0.09480800.0104799-9.0466300.00e+00
regionSE32 Central Norrland-0.10997160.0104799-10.4935500.00e+00
regionSE33 Upper Norrland-0.13602350.0104799-12.9794430.00e+00
summary(model)$r.squared
## [1] 0.8817871
Anova(model, type=2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 1: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.06291831114.575300
sex0.06892701125.517280
region0.1548911740.294190
Residuals0.038440170NANA

Let’s check what we have found.

model <-lm (log(salary) ~ year_n + sex + NUTS2_sh , data = tb) 
 
plot_model(model, type = "pred", terms = c("NUTS2_sh"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 5: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 6: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

tb[38,]
## # A tibble: 1 x 11
##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit
##                              
## 1 SE21 ~ 0 all~ 214 Engineering~ women 2016   34700   2016 SE21     38698.
## # ... with 2 more variables: lwr , upr 
tb[27,]
## # A tibble: 1 x 11
##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit
##                              
## 1 SE31 ~ 0 all~ 214 Engineering~ men   2015   38100   2015 SE31     41616.
## # ... with 2 more variables: lwr , upr 
tb[5,]
## # A tibble: 1 x 11
##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit
##                              
## 1 SE21 ~ 0 all~ 214 Engineering~ men   2014   41500   2014 SE21     39442.
## # ... with 2 more variables: lwr , upr 

Also examine the interaction between gender and region. The F-value from the Anova table for the interaction is 2,7 (Pr(>F) < 0.017) implying a relation to 95 % significance.

model <-lm (log(salary) ~ year_n + sex * NUTS2_sh , data = tb) 

summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 2: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-29.22128643.4559410-8.45537760.0000000
year_n0.01983030.001714211.56789700.0000000
sexwomen-0.06999980.0137140-5.10426180.0000033
NUTS2_shSE12-0.07550670.0137140-5.50581320.0000007
NUTS2_shSE21-0.11614900.0137140-8.46937560.0000000
NUTS2_shSE22-0.05205420.0137140-3.79570180.0003332
NUTS2_shSE23-0.05353870.0137140-3.90394710.0002331
NUTS2_shSE31-0.10990090.0137140-8.01377910.0000000
NUTS2_shSE32-0.12930180.0137140-9.42845430.0000000
NUTS2_shSE33-0.13277960.0137140-9.68204900.0000000
sexwomen:NUTS2_shSE120.03604250.01939451.85838380.0677877
sexwomen:NUTS2_shSE21-0.02497910.0193945-1.28794410.2024762
sexwomen:NUTS2_shSE220.01256340.01939450.64778150.5194801
sexwomen:NUTS2_shSE230.00436830.01939450.22523130.8225283
sexwomen:NUTS2_shSE310.03018600.01939451.55641700.1246186
sexwomen:NUTS2_shSE320.03866050.01939451.99337060.0505553
sexwomen:NUTS2_shSE33-0.00648790.0193945-0.33452060.7390979
summary(model)$r.squared
## [1] 0.908906
Anova(model, type=2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 2: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.06291831133.8162410.0000000
sex0.06892701146.5957360.0000000
NUTS2_sh0.1548911747.0609090.0000000
sex:NUTS2_sh0.008818472.6793270.0170929
Residuals0.029621663NANA
plot_model(model, type = "pred", terms = c("NUTS2_sh", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 7: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 8: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

tb[37,]
## # A tibble: 1 x 11
##   region sector `occuptional  (~ sex   year  salary year_n NUTS2_sh    fit
##                              
## 1 SE21 ~ 0 all~ 214 Engineering~ men   2016   40100   2016 SE21     41037.
## # ... with 2 more variables: lwr , upr 

And the interaction between year and region. The F-value from the Anova table for the region is 0,57 (Pr(>F) < 0,77) implying no significant relation.

model <-lm (log(salary) ~ year_n * NUTS2_sh + sex , data = tb) 
 
summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 3: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-32.195566810.7951966-2.98239750.0040635
year_n0.02130280.00535483.97829330.0001818
NUTS2_shSE12-4.020490515.2667129-0.26335010.7931402
NUTS2_shSE210.648134515.26671290.04245410.9662710
NUTS2_shSE2218.287310315.26671291.19785510.2354607
NUTS2_shSE237.756516715.26671290.50806720.6131806
NUTS2_shSE31-5.489546415.2667129-0.35957620.7203666
NUTS2_shSE32-3.284560715.2667129-0.21514520.8303490
NUTS2_shSE339.227648215.26671290.60442930.5477290
sexwomen-0.05870560.0053548-10.96326320.0000000
year_n:NUTS2_shSE120.00196580.00757280.25958480.7960305
year_n:NUTS2_shSE21-0.00038530.0075728-0.05088020.9595820
year_n:NUTS2_shSE22-0.00909380.0075728-1.20085360.2343036
year_n:NUTS2_shSE23-0.00387300.0075728-0.51143120.6108371
year_n:NUTS2_shSE310.00267600.00757280.35336620.7249937
year_n:NUTS2_shSE320.00157470.00757280.20794190.8359451
year_n:NUTS2_shSE33-0.00464470.0075728-0.61333920.5418602
summary(model)$r.squared
## [1] 0.8888956
Anova(model, type=2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 3: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.06291831109.71529450.0000000
NUTS2_sh0.1548911738.58501320.0000000
sex0.06892701120.19314090.0000000
year_n:NUTS2_sh0.002311570.57582460.7728945
Residuals0.036128563NANA
plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 9: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 10: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

Finally the interaction between gender, year and region, the only significant interaction is between the region and gender, F-value from the Anova table for the interaction is 2,4 (Pr(>F) < 0.033)

model <-lm (log(salary) ~ year_n * NUTS2_sh * sex , data = tb) 
 
summary(model) %>%  
  tidy() %>%
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Summary from linear model fit')
Table 4: Summary from linear model fit
termestimatestd.errorstatisticp.value
(Intercept)-34.671520414.5497427-2.38296450.0211805
year_n0.02253380.00721713.12225850.0030380
NUTS2_shSE125.202183920.57644350.25282230.8014851
NUTS2_shSE215.508200220.57644350.26769450.7900815
NUTS2_shSE2225.530872120.57644351.24078160.2207168
NUTS2_shSE2316.116232820.57644350.78323700.4373355
NUTS2_shSE31-18.437163420.5764435-0.89603260.3747072
NUTS2_shSE327.685506720.57644350.37351000.7104136
NUTS2_shSE335.053004020.57644350.24557230.8070603
sexwomen4.893201720.57644350.23780600.8130438
year_n:NUTS2_shSE12-0.00261790.0102066-0.25649190.7986672
year_n:NUTS2_shSE21-0.00278990.0102066-0.27333930.7857651
year_n:NUTS2_shSE22-0.01268990.0102066-1.24331170.2197916
year_n:NUTS2_shSE23-0.00802070.0102066-0.78583920.4358239
year_n:NUTS2_shSE310.00909090.01020660.89069170.3775375
year_n:NUTS2_shSE32-0.00387640.0102066-0.37979400.7057736
year_n:NUTS2_shSE33-0.00257230.0102066-0.25202530.8020975
year_n:sexwomen-0.00246190.0102066-0.24120800.8104213
NUTS2_shSE12:sexwomen-18.445348729.0994854-0.63387200.5291736
NUTS2_shSE21:sexwomen-9.720131529.0994854-0.33403100.7398112
NUTS2_shSE22:sexwomen-14.487123629.0994854-0.49784810.6208644
NUTS2_shSE23:sexwomen-16.719432229.0994854-0.57456110.5682714
NUTS2_shSE31:sexwomen25.895234129.09948540.88988630.3779654
NUTS2_shSE32:sexwomen-21.940134829.0994854-0.75396990.4545502
NUTS2_shSE33:sexwomen8.349288429.09948540.28692220.7754068
year_n:NUTS2_shSE12:sexwomen0.00916740.01443430.63511070.5283723
year_n:NUTS2_shSE21:sexwomen0.00480910.01443430.33317270.7404549
year_n:NUTS2_shSE22:sexwomen0.00719230.01443430.49828000.6205623
year_n:NUTS2_shSE23:sexwomen0.00829550.01443430.57471130.5681705
year_n:NUTS2_shSE31:sexwomen-0.01282990.0144343-0.88884920.3785170
year_n:NUTS2_shSE32:sexwomen0.01090220.01443430.75529860.4537602
year_n:NUTS2_shSE33:sexwomen-0.00414470.0144343-0.28714520.7752371
summary(model)$r.squared
## [1] 0.9231133
Anova(model, type=2) %>% 
  tidy() %>% 
  knitr::kable( 
  booktabs = TRUE,
  caption = 'Anova report from linear model fit')
Table 4: Anova report from linear model fit
termsumsqdfstatisticp.value
year_n0.06291831120.79462900.0000000
NUTS2_sh0.16371271422.45045120.0000000
sex0.0777463818.65780500.0000000
year_n:NUTS2_sh0.002311570.63397280.7254512
year_n:sex0.000008510.01639690.8986442
NUTS2_sh:sex0.008818472.41860300.0332038
year_n:NUTS2_sh:sex0.002299870.63075500.7280524
Residuals0.025001848NANA
plot_model(model, type = "pred", terms = c("NUTS2_sh", "year_n", "sex"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 11: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

plot(model, which = 1)

SSYK 214, Architects, engineers and related professionals, Year 2014 - 2018

Figure 12: SSYK 214, Architects, engineers and related professionals, Year 2014 – 2018

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)