The Developer Survey Results is available here. The question I want to analyze is Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)
An excellent resource to study networks, with R and other tools, is Katherine Ognyanova’s blog.
An example of bipartite networks is the Product Space, a network that clusters similar products, based on the products that export those products.
Under a similar idea, connecting respondents to the programming/scripting/markup languages they use, I can create a network of similar programming languages.
Visualizing Networks With R
The first step is to read and re-arrange the data:
library(tidyverse) library(janitor) survey_results <- read_csv( "~/Downloads/developer_survey_2019/survey_results_public.csv") %>% clean_names() %>% select(respondent, language_worked_with, language_desire_next_year) %>% gather(category, answer, -respondent) %>% separate(answer, into = paste0("lang", 1:28), sep = ";") %>% gather(lang_aux, language, -respondent, -category) %>% select(-lang_aux) %>% drop_na() %>% mutate(language = str_replace_all(language, "Other.*", "Other(s)")) users_per_language_worked_with <- survey_results %>% filter(category == "language_worked_with") %>% group_by(language) %>% summarise(n = n()) %>% mutate(share = n / sum(n)) binary_relation <- survey_results %>% group_by(respondent, language) %>% summarise(n = n()) %>% mutate(x = ifelse(n == 2, 1, 0)) %>% select(-n)
economiccomplexity functions are designed for country-product relations,
so here the “countries” are respondents and the “products” are languages.
Let’s explore the programming language complexity index. Here “complexity” is not related to difficulty but to specialization instead. A higher index value means that language has a small group of users and/or that is used for specific purposes.
library(economiccomplexity) rca <- ec_rca( binary_relation, "respondent", "language", "x" ) com <- ec_complexity_measures(rca, tbl = T) names(com$complexity_index_p) <- c("language","complexity_index") com$complexity_index_p # A tibble: 28 x 2 language complexity_index <chr> <dbl> 1 Erlang 27.6 2 F# 0.272 3 WebAssembly 0.0729 4 Elixir 0.0480 5 Clojure 0.0102 6 Dart 0.000784 7 VBA 0.000288 8 Objective-C 0.000178 9 Scala 0.00000487 10 Assembly 0.00000480 # … with 18 more rows
At this point I shall apply a trick. As
computes a language-language relation and a respondent-respondent relation in
this case, and both relations independent inside the function, I shall use the
compute parameter. This is to avoid a very large computation (75,816
respondants) that I won’t use.
pro <- ec_proximity(rca, u = com$ubiquity, compute = "product", tbl = T) pro$proximity_p # A tibble: 378 x 3 from to value <chr> <chr> <dbl> 1 Bash/Shell/PowerShell Assembly 0.0532 2 C Assembly 0.168 3 C# Assembly 0.0252 4 C++ Assembly 0.101 5 Clojure Assembly 0.0269 6 Dart Assembly 0.0293 7 Elixir Assembly 0.0236 8 Erlang Assembly 0.0260 9 F# Assembly 0.0269 10 Go Assembly 0.0403 # … with 368 more rows
Finally I can create the network.
library(igraph) library(ggraph) set.seed(1724) net <- ec_networks( pc = NULL, pp = pro$proximity_p, cutoff_p = 0.2, tbl = T, compute = "product" ) share <- 100 * users_per_language_worked_with$share names(share) <- users_per_language_worked_with$language g <- net$network_p %>% rename(proximity = value) %>% graph_from_data_frame(directed = F) g %>% ggraph(layout = "kk") + geom_edge_link(aes(edge_alpha = proximity, edge_width = proximity), edge_colour = "#a8a8a8") + geom_node_point(colour = "darkslategray4", size = share[match(V(g)$name,names(share))]) + geom_node_text(aes(label = name), vjust = 2.2) + ggtitle("Stackoverflow Developer Survey Languages Connection") + theme_void()