prepdat: Preparing Experimental Data for Statistical Analysis
[social4i size=”large” align=”float-right”]
Guest post by Ayala S. Allon, School of Psychological Sciences, Tel-Aviv University
prepdat
is an R package that helps researchers to optimize and speedup their analysis, providing various cross sections of the data in order to better understand the results.
prepdat
was created by Ayala S. Allon and Roy Luria. The full papaer about prepdat
was published in Journal of Open Research Software on Nov 25, 2016, and can be downloaded here.
To better understand the abilities of the package let’s look at an example. Look at the two distributions in the figure below. The difference between the means of these two distributions is significant (t(20) = 8.65, p < 0.001). As can be seen, the “mdvc1” distribution is right skewed, and the “mdvc5” distribution is not skewed.
It could be that the “mdvc1” distribution, because of it’s skeweness, is not best characterized by the mean, but rather by another dependent measure such as the 25th percentile. As such, one should consider looking and testing different cross sections and dependent measures of the data because it can provide information about the source of the effect in question.
Yet, in many studies the comparision between experimental conditions in the statistical inference stage is done directly on the means without examining other cross sections of the data.
prepdat
, using the prep()
function, outputs various dependent measures of the dependent variable (e.g., means after rejecting observations according to a flexible standard deviation criteria and percentiles), enabling the user to better understand the results. In addition, prep()
enables to aggregate raw data tables in a long format according to any number of grouping variables (i.e., independent variables).
prep()
Aggregating a raw data table includes reducing the amounts of data to the desired level of information, resulting in a finalized table, usually in a wide format, in which each row in the table refers to a specific subject (which is the variable that identifies the unit upon which the measurement took place; i.e., the id variable), and each cell in the table usually reflects the averaged performance of that subject according to the desired grouping variables (i.e., the independent and dependent variables). This finalized table often contains only selected variables relative to the raw data table: For each aggregated cell in the finalized tableprep()
will output:
- Means before and after rejecting observations according to a flexible standard deviation criteria.
- Number of rejected observations according to the flexible standard deviation criteria.
- Proportions of rejected observations according to the flexible standard deviation criteria.
- Number of observations before rejection.
- Standard deviations.
- Medians.
- Additional percentiles (e.g., the 0.05th, 0.25th, 0.75th, 0.95th percentiles).
- Means after rejecting observations according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times).
- Harmonic means.
prep()
is suitable for aggregating various types of experimental designs such as between-subjects designs, within-subjects (i.e., repeated measures) designs, and mixed designs (i.e., designs that combine between-subjects and within-subjects independent variables). prep()
is very easy to use, and only involves filling various arguments and when needed, making changes to default procedures for removing outliers.
prep()
accepts the following arguments:
prep( dataset = NULL , file_name = NULL , file_path = NULL , id = NULL , within_vars = c() , between_vars = c() , dvc = NULL , dvd = NULL , keep_trials = NULL , drop_vars = c() , keep_trials_dvc = NULL , keep_trials_dvd = NULL , id_properties = c() , sd_criterion = c(1, 1.5, 2) , percentiles = c(0.05, 0.25, 0.75, 0.95) , outlier_removal = NULL , keep_trials_outlier = NULL , decimal_places = 4 , notification = TRUE , dm = c() , save_results = TRUE , results_name = "results.txt" , results_path = NULL , save_summary = TRUE )
file_merge()
In many research fields the outcome of running an experiment is a raw data file (e.g., a text file) for each subject, containing a table in which each row describes one trial conducted during the experiment. For example in Experimental Psychology, this file will contain numerical description of the subject’s performance in the various experimental conditions. The columns in this raw data table will describe the independent variables, dependent variables, and various characteristics of the subject and the experiment (e.g., age, gender, and a numerical description of the stimulus in the experiment). The rows in this raw data table will describe the observations (i.e., trials) conducted during the experiment, such that each row in the table corresponds to one observation. Usually, this raw data table has over a hundred lines, and the number of raw data files corresponds to the number of subjects in a given experiment. The next step (before aggregating the data into one finalized table), is to merge all files into one big raw data table.file_merge()
enables to merge the individuals raw data files into one big raw data table containing a ‘chain’ of raw data from all subjects, one after the other.
file_merge()
accepts the following arguments:
file_merge( folder_path = NULL , has_header = TRUE , new_header = c() , raw_file_name = NULL , raw_file_extension = NULL , file_name = "dataset.txt" , save_table = TRUE , dir_save_table = folder_path , notification = TRUE )
Installation
install.packages("prepdat")To install the most current version of
prepdat
, sometimes even before its official release on CRAN:
devtools::install_github("ayalaallon/prepdat")To load
prepdat
in a current R session:
library(prepdat)
Summary
To summarize,prepdat
enables the user to easily and quickly merge (using file_merge()
) and aggregate raw data tables (using prep()
) while keeping track and summarizing every step of the preparation.