# Blog Archives

## Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples.  Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis!  Most of these posts involve data visualization and plotting, and I include a lot of detail and

## Performing Logistic Regression in R and SAS

Introduction My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the

## Online index of plots and corresponding R scripts

Dear Readers of The Chemical Statistician, While working in my job at the British Columbia Cancer Agency, I learned about a wonderful new data visualization resource from a colleague who works at the British Columbia Centre for Disease Control.  I want to share this with you, as I think that it will help you immensely in your efforts

## The Chi-Squared Test of Independence – An Example in Both R and SAS

$The Chi-Squared Test of Independence – An Example in Both R and SAS$

Introduction The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, and , the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and

## Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to

## Useful Functions in R for Manipulating Text Data

Introduction In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from

## Rectangular Integration (a.k.a. The Midpoint Rule)

$Rectangular Integration (a.k.a. The Midpoint Rule)$

Introduction Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the distribution actually integrates to 1 over its support set.  This post follows from my

## Trapezoidal Integration – Conceptual Foundations and a Statistical Application in R

Introduction Today, I will begin a series of posts on numerical integration, which has a wide range of applications in many fields, including statistics.  I will introduce with trapezoidal integration by discussing its conceptual foundations, write my own R function to implement trapezoidal integration, and use it to check that the Beta(2, 5) probability density

## Detecting an Unfair Die with Bayes’ Theorem

$Detecting an Unfair Die with Bayes’ Theorem$

Introduction I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook.  I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively. The Problem The following question is

## Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

Introduction Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution.  I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in