How to simplify your code by using data flows

[This article was first published on gtdir, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can one effectively develop and manage code in large complex data analysis projects?

In the past I routinely developed conventions for naming my R scripts so that those scripts have prefixes to determine the sequence to run the scripts. I used this convention several years until I came across a massive data analysis task. I needed to process data generated by a trading algorithm that managed a portfolio of hundreds of stocks. The initial solution was clear — to write R scripts that manage other R scripts. So I persisted. However, some tasks had to be run manually, such as launching a sequence of R instances that would process data in parallel. Finally, it became clear that using script naming conventions and special folder structures were not an optimal solution as I had an even more complicated challenge ahead — connecting the algorithm to the market. The workflow was no longer hierarchical but had a structure that could only be conveyed by a graph with loops.

Thus came into being the platform pxWorks (, the screenshot is below). The platform is open source and the code is published under AGPLv3.

Some of the features of this platform are as follows:

  • Running code in any scripting language or any compiled code in a code block. For example one can easily mix R and Python code in your project.
  • Easy code debugging due to the fact that a code in a block can be run in isolation from the rest of the code and has user defined inputs and outputs that are saved to disk.
  • Ability to implement any programming logic on a graph that determines data flows.
  • Ability to implement conditional loops easily by using conditional connections (sockets). No special blocks are required for that (as is usually the case in some visual programming environments.
  • Possibility of modification of programming blocks (graph nodes) on the fly by simply editing underlying text files that define each block and refreshing the block.
  • Extensible code block library. One can easily add a block of code into the library for reuse. Simple folders in the library directory are treated as ‘folders’ in the library menu, so blocks can be easily grouped.

More details and technical specifications of the platform can be found in the forum on the website of the project.

I am currently looking for collaborators and feedback to help me improve the software to make it even more useful to as many people as possible. Let me know if you need any new features. Participating is easy, just fork the code on GitHub and start extending the code base or report any issues you have.

I am also developing a production-stage algorithmic trading system using this platform, so leave a comment at the forum of the website if you are interested in that trading code being open sourced.

To leave a comment for the author, please follow the link and comment on their blog: gtdir. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)