I recently had a task to take a look at some assessment (audit) data. I was assuming, rather hoping for data with a normal distribution and thought it would be a quick case of Pearson correlation between two columns: “Duration” and “Score”. Just conjecture at this point as I did not understand what the assessment process that generated the scores and durations entails, I proposed that in general an assessment’s “Score” might decrease as “Duration” increases (negatively correlated). This could be explained by:
- The person conducting the assessment being more thorough, experienced, or skilled
- There are more negative findings thus more to assess
- The critically of specific issues are so blatantly bad that it prompts the assessor to delve into something they’d otherwise dismiss
- The first 3 questions out of an assessment of 30 questions are wrong which change the posture of the auditor for the duration of the audit. Similar to the difference between velocity and acceleration in physics, the more questions you get wrong upfront the higher your probability of lower overall score.
- Causation: X is wrong which mandates the assessor to examine Y as they are integral, and so on
Not necessarily with a hypothesis, starting with a .csv and working in R, the task would include:
Check Model Assumptions:
1. Check the form of the model.
2. Check for outliers
It is well known that it is statistical malpractice to remove outliers as they might be telling a story or highlight an inherent flaw in the system or process that obtains or generates the data. With this specific dataset there were conspicuous problems such as auditors that forgot to end the audit leading to inordinate duration values, this was part of the initial implementation process such as lack of auditor training.
A business can reap benefits from a data Analyst that strives to reduce variation, not just model it. It is a better use of time to discover the underlying causes of variation rather than massaging data to find the correct distribution or transformation method to make low-score predictions.
3. Check for independence.
4. Check for constant variance.
5. Check for normality.
The question data scientists expect the normality test to answer is, does the data deviate enough from the Gaussian paradigm to forbid use of tests that assume Gaussian distributions? Scientists intend for the normality test to indicate when to abandon conventional tests (ANOVA, etc.) and instead analyze transformed data, use a rank-based non-parametric test,resampling, or bootstrap approach.
Generally testing for normality (or any distributional assumption) should consist of two parts:
a. Graphical inspection of data via either a normal probability plot or density estimator plot
b. Formal goodness-of-fit test such as the Shapiro-Wilks, Anderson-Darlin, or Carmer-von Mises
I like a visual representation that conveys the distribution of the data rather than tests. It seems that most data analysts align themselves with George Box’s thoughts that, “To make a preliminary test on variances is rather like putting to sea in a row boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!”
I prefer a density plot over truehist(), > density <- density(data) # returns the density data > plot(density)
Deciding on the type of transformation to make data “normal”
Once we determine a distribution we decide whether to transform the variables. The rationale might be to make outcome more normally distributed, equalize outcome variance, or to linearize predictor effects. The drawbacks are that the original or untransformed variables might be more interpretable or credible such as the difference of natural scale cost versus log cost.
Select a correlation Test
I won’t go into the full details of how to implement R or the proofs behind each mathematical method.
To be continued…