This came up recently on StackOverflow. One of the answers was particularly helpful and I plan to adopt this for my future work. In fact, it is close to what I already do, but is a little more structured.
The idea is to break the code into four files, all stored in your project directory. These four files are to be processed in the following order.
- This file includes all code associated with loading the data. Usually, it will be a short file reading in data from files.
- This is where you do all the pre-processing of data, such as taking care of missing values, merging data frames, handling outliers. By the end of this file, the data should be in a clean state, ready to use. It is much better to do this here rather than clean the data on the original file as this enables you to have a complete record of everything done to the data.
- All of the functions needed to perform the actual analysis are stored here. This file should do nothing other than define the functions you need for analysis. (If you require your own functions for loading or cleaning the data, include them at the top of either load.R or clean.R.) In particular, functions.R should not do anything to the data. This means that you can modify this file and reload it without having to go back and repeat steps 1 & 2 which can take a long time to run for large data sets.
- Here is the code to actually do the analysis. This file will use the functions defined in functions.R to do the calculations, produce figures and tables, etc. All figures and tables that end up in your report, paper or thesis should be coded here. Never create figures and tables manually (i.e., with the mouse and menus) as then you can’t easily reproduce.
It is a good idea to save your workspace after each file is run.
There are many advantages to this set up. First, you don’t have to reload the data each time you make a change in a subsequent step. Second, if you come back to an old project, you will be able to work out what was done relatively quickly. It also forces a certain amount of structured thinking in what you are doing, which is helpful.
Often there will be bits and pieces of code that you write, but don’t end up using, yet don’t want to delete. These should either be commented out or saved in files with other names. All analysis from reading data to producing the final results should be reproducible by simply
source()ing these four files in order with no further user intervention.