Comparing dependencies of popular machine learning packages with `pkgnet`
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When looking through the CRAN list of packages, I stumbled upon this little gem:
pkgnet is an R library designed for the analysis of R libraries! The goal of the package is to build a graph representation of a package and its dependencies.
And I thought it would be fun to play around with it. The little analysis I ended up doing was to compare dependencies of popular machine learning packages.
- I first loaded the packages:
library(pkgnet) library(tidygraph) ## ## Attache Paket: 'tidygraph' ## The following object is masked from 'package:stats': ## ## filter library(ggraph) ## Lade nötiges Paket: ggplot2
- I then created a function that will
- create the package report with
pkgnet::CreatePackageReport
- convert the edge (
report$DependencyReporter$edges
) and node (report$DependencyReporter$nodes
) data into a graph object withtidygraph::as_tbl_graph
create_pkg_graph <- function(package_name, DependencyReporter = TRUE) { report <- CreatePackageReport(pkg_name = package_name) if (DependencyReporter) { graph <- as_tbl_graph(report$DependencyReporter$edges, directed = TRUE, nodes = as.data.frame(report$DependencyReporter$nodes)) } else { graph <- as_tbl_graph(report$FunctionReporter$edges, directed = TRUE, nodes = as.data.frame(report$FunctionReporter$nodes)) } return(graph) }
To create a vector of machine learning packages from R I looked at CRAN’s machine learning task view
These are the packages I ended up including:
pkg_list <- c("caret", "h2o", "e1071", "mlr")
Note: I wanted to include other packages, like tensorflow
, randomFores
, gbm
, etc. but for those, pkgnet
threw an error:
Error in data.table::data.table(node = names(igraph::V(self$pkg_graph)), : column or argument 1 is NULL
- Next, I ran them through my function from before and assigned them each a unique name.
for (pkg in pkg_list) { graph <- create_pkg_graph(pkg) assign(paste0("graph_", pkg), graph) }
- These individual objects I combined with
tidygraph
and calculated node centrality as the number of outgoing edges.
graph <- graph_caret %>% graph_join(graph_h2o, by = "name") %>% graph_join(graph_e1071, by = "name") %>% graph_join(graph_mlr, by = "name") %>% mutate(color = ifelse(name %in% pkg_list, "a", "b"), centrality = centrality_degree(mode = "out"))
- Finally, I plotted the dependency network with
ggraph
:
The bigger the node labels (package names), the higher their centrality. Seems like the more basic utilitarian packages have the highest centrality (not really a surprise…).
graph %>% ggraph(layout = 'nicely') + geom_edge_link(arrow = arrow()) + geom_node_point() + geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) + theme_graph() + scale_fill_brewer(palette = "Set1")
- Because the complete network is a bit hard to make sense of, I plotted it again with only the packages I wanted to analyze plus dependencies that had at least 1 outgoing edge; now it is easier to see shared dependencies.
For example, methods
and stats
are dependencies of caret
, mlr
and e1071
but not h2o
, while utils
is a dependency of all four.
graph %>% filter(centrality > 1 | color == "a") %>% ggraph(layout = 'nicely') + geom_edge_link(arrow = arrow()) + geom_node_point() + geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) + theme_graph() + scale_fill_brewer(palette = "Set1")
It would of course be interesting to analyse a bigger network with more packages. Maybe someone knows how to get these other packages to work with pkgnet
?
sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.4 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 ggraph_1.0.1 ggplot2_2.2.1 tidygraph_1.1.0 ## [5] pkgnet_0.2.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.16 RColorBrewer_1.1-2 plyr_1.8.4 ## [4] compiler_3.5.0 pillar_1.2.2 formatR_1.5 ## [7] futile.logger_1.4.3 bindr_0.1.1 viridis_0.5.1 ## [10] futile.options_1.0.1 tools_3.5.0 digest_0.6.15 ## [13] viridisLite_0.3.0 gtable_0.2.0 jsonlite_1.5 ## [16] evaluate_0.10.1 tibble_1.4.2 pkgconfig_2.0.1 ## [19] rlang_0.2.0 igraph_1.2.1 ggrepel_0.7.0 ## [22] yaml_2.1.18 blogdown_0.6 xfun_0.1 ## [25] gridExtra_2.3 stringr_1.3.0 dplyr_0.7.4 ## [28] knitr_1.20 htmlwidgets_1.2 grid_3.5.0 ## [31] rprojroot_1.3-2 glue_1.2.0 data.table_1.10.4-3 ## [34] R6_2.2.2 rmarkdown_1.9 bookdown_0.7 ## [37] udunits2_0.13 tweenr_0.1.5 tidyr_0.8.0 ## [40] purrr_0.2.4 lambda.r_1.2.2 magrittr_1.5 ## [43] units_0.5-1 MASS_7.3-49 scales_0.5.0 ## [46] backports_1.1.2 mvbutils_2.7.4.1 htmltools_0.3.6 ## [49] assertthat_0.2.0 ggforce_0.1.1 colorspace_1.3-2 ## [52] labeling_0.3 stringi_1.1.7 visNetwork_2.0.3 ## [55] lazyeval_0.2.1 munsell_0.4.3
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.