Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).
My first take on the data was just to treat it as a graph and chart flows without paying attention to time other than to direct the edges (module A taken directky after module B; if multiple modules are taken by the same student in the same presentation, they both have the same precursor(s) and follow on module(s), if any) – the following (dummy) data shows the sort of thing we can get out using
networkx and the
The data can also be filtered to show just the modules taken leading up to a particular module, or taken following a particular module.
diagram package has a couple of functions that can generate similar sorts of network diagram using its
plotweb() function. For example, a simple flow graph:
Or something that looks a bit more like a finite state machine diagram:
(In passing, I also note the R
diagram package can be used to draw electrical circuit diagrams/schematics.)
For example, the following could describe total flow from one module to another over a given period of time:
If there are a limited number of presentations (or modules) of interest, we could further break down each category to show the count of students taking a module in a particular presentation (or going directly on to / having directly come from a particular module; in this case, we may want an “other” group to act as a catch all for modules outside a set of modules we are interested in; getting the proportions right might also be a fudge).
Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.
sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!
Again, it may be appropriate to introduce a catch-all category to group modules into a generic Other bin where there is only a small flow to/from that module to reduce clutter in the diagram.
We can use the
ipysankeywidget to render a simple graph data structure of the sort that can be generated by
One big problems with the view I took of the data is that it doesn’t respect time, or the numbers of students taking a particular presentation of a course. This detail could help tell a story about the evolving curriculum as new modules come on stream, for example, and perhaps change student behaviour about the module they take next from a particular module. So how could we capture it?
If we can linearise the flow by using
module_presentation keyed nodes, rather than just
module identified nodes, and limit the data to just show students progressing from one presentation to the next, we should be able to use something line a categorical parallel co-ordinates plot, such as an alluvial diagram from the R
With time indexed modules, we can also explore richer Sankey style diagrams that require a one way flow (no loops).
So for example, here are a few more packages that might be able to help with that, as well as the aforementioned Python
First up, the R networkD3 package includes support for Sankey diagrams. Data can be sourced from
igraph and then exported into the JSON format expected by the package:
riverplot package also supports Sankey diagrams – and the gallery includes a demo of how to recreate Minard’s visualisation of Napoleon’s 1812 march.
sankey package generates a simple Sankey diagram from a simple data table:
Back in the Python world, the
pySankey package can generate a simple Sankey diagram from a pandas dataframe.
What I really need to do now is set up a Binder demo of each of them… but that will have to wait till another day…
If you know of any other R or Python packages / demos that might be useful for visualising flows, please let me know via the comments and I’ll add them to the post.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...