The network plot of Mutations

September 29, 2016

(This article was first published on My Data Science Journey, and kindly contributed to R-bloggers)

In a pet project, I created a network plot in R, to represent mutations and how combinations improved or worsened a mutation. I have tried to document the way I approached this whole problem in this post.


First let’s look at the input data.
An excel sheet with a column of mutations and a column of the Half Life Improvement factors would do for input.

Mutation HIF
A1B 5
A1B B2C 6
A1B B2C C3D 3
C3D Z25A 7
A1C 4

Since the inputs I had were in xlsx format, I used the XLConnect package to read and write from it.

I had to write some code to clean up the data. For example, sometimes the mutations were separated by ‘+’ instead of a single space and so on. One might have an input file with a lot of irrelevant information, or duplicates.
Many small functions were need to clean this according to the various input file.

Creating Nodes

Then I had to create “nodes” from the list of mutations. For this the code involved getting unique records of “mutations” and I also added a bit of code to count the number of substitutions in the mutation. Now the table would look something like:

Mutation HIF NSubs
A1B 5 1
A1B B2C 6 2
A1B B2C C3D 3 3
C3D Z25A 7 2
A1C 4 1

Creating Edges

Now that we have the nodes, we need to make their “edges” or “links”.
Looping through the number of substitutions, I sorted the data by number of substitutions, and then further looping through the mutations, made connections by checking for matching mutations.
Also, I decided to use the “networkD3” package in R, so I need to convert the mutations to a number, and edges defined as “source” and “target”, also as numbers.

Now NetworkD3 is based on d3js. And this being java based, the numbering should start from 0.

Our nodes would now look like:

ID Mutation HIF NSubs
0 A1B 5 1
1 A1B B2C 6 2
2 A1B B2C C3D 3 3
3 C3D Z25A 7 2
4 A1C 4 1

And the edges would look like:

Source ID From Mutation HIF NSubs
0 0 A1B A1B 5 1
0 1 A1B A1B B2C 6 2
0 2 A1B A1B B2C C3D 3 3
3 3 C3D Z25A C3D Z25A 7 2
4 4 A1C A1C 4 1

You may want to save this in an excel, with sheets named Node and Edges respectively.

Plotting the graph

As mentioned earlier, I used the networkD3 package.  And in that the forceNetwork function. This has a lot more options for effect and hence I used it in my project. There are other types of visualization available under networkD3, all based on the D3.js.

 fn <- forceNetwork(Links = links, Nodes = nodes,  
Source = "Source", Target = "ID", Value = "NSub",
NodeID = "Mutation",
Nodesize = "HIF", Group = "group",
zoom = T, bounded = F, legend = T,
opacity = 0.8,
fontSize = 16,
width = 1600, height = 1200

The output was then saved as an HTML file for sharing with end users.

Customizing the results

Then I started needing customized features in the visualization. I found this link giving ideas for a few, and using it as inspiration added a search box among other things.

One can use HTML::onRender to add the javascript code, but what I did instead was to find the package file directly at /usr/local/lib/R/site-library/networkD3/htmlwidgets/ and edited it on sudo mode. To repackage, I used the command:

 sudo R CMD INSTALL /usr/local/lib/R/site-library/networkD3  

The html code for adding a search box was added to the R code itself, using the browsable tag. I got help for this part, from a question I asked on stack overflow.

The code for adding the search:

 fn <- forceNetwork(Links = links, Nodes = nodes,   
Source = "Source", Target = "ID", Value = "NSub",
NodeID = "Mutation",
Nodesize = "HIF", Group = "group",
zoom = T, bounded = F, legend = T,
opacity = 0.8,
fontSize = 16,
width = 1600, height = 1200



Also included is html code for an information box that opens when a node is clicked. The file now begins to look like:

Single clicking a node gives a box with information, double clicking or searching a node highlights it and it’s immediate neighbors.

Typical use of this would be by protein designers, who would be able to then see how the substitutions have been working and what direction they can make further substitutions to get the molecule they desire.

There is a lot more that can be done to improve this, but for now, this helps.

To leave a comment for the author, please follow the link and comment on their blog: My Data Science Journey. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)