Blog Archives

Easy quick PCA analysis in R

May 22, 2019
By
Easy quick PCA analysis in R

Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html)

Read more »

Using clusterlab to benchmark clustering algorithms

January 15, 2019
By
Using clusterlab to benchmark clustering algorithms

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with __1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know

Read more »

Part 5: Code corrections to optimism corrected bootstrapping series

December 29, 2018
By
Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method

Read more »

Part 4: Why does bias occur in optimism corrected bootstrapping?

December 28, 2018
By
Part 4: Why does bias occur in optimism corrected bootstrapping?

In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism,

Read more »

Part 3: Two more implementations of optimism corrected bootstrapping show shocking bias

December 27, 2018
By
Part 3: Two more implementations of optimism corrected bootstrapping show shocking bias

Welcome to part III of debunking the optimism corrected bootstrap in high dimensions (quite high number of features) in the Christmas holidays. Previously we saw with a reproducible code implementation that this method is very bias when we have many features (50-100 or more). I suggest avoiding this method until at some point it has

Read more »

Part 2: Optimism corrected bootstrapping is definitely bias, further evidence

December 26, 2018
By
Part 2: Optimism corrected bootstrapping is definitely bias, further evidence

Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on

Read more »

Optimism corrected bootstrapping: a problematic method

December 25, 2018
By
Optimism corrected bootstrapping: a problematic method

There are lots of ways to assess how predictive a model is while correcting for overfitting. In Caret the main methods I use are leave one out cross validation, for when we have relatively few samples, and k fold cross validation when we have more. There also is another method called ‘optimism corrected bootstrapping’, that

Read more »

Simulating NXN dimensional Gaussian clusters in R

August 25, 2018
By
Simulating NXN dimensional Gaussian clusters in R

Gaussian clusters are found in a range of fields and simulating them is important as often we will want to test a given class discovery tools performance under conditions where the ground truth is known (e.g. K=6). However, a flexible Gaussian cluster simulator for simulating Gaussian clusters with defined variance, spacing, and size does not

Read more »

How to perform consensus clustering without overfitting and reject the null hypothesis

August 21, 2018
By
How to perform consensus clustering without overfitting and reject the null hypothesis

The Monti et al. (2003) consensus clustering algorithm is one of the most widely used class discovery techniques in the genome sciences and is commonly used to cluster transcriptomic, epigenetic, proteomic, and a range of other types of data. It can automatically decide the number of classes (K), by resampling the data and for each

Read more »

Forcasting the price of bitcoin with the CRAN forecast package

July 25, 2018
By
Forcasting the price of bitcoin with the CRAN forecast package

There is interest in bitcoin at the moment because it is displaying signs of steady year to year growth with brief boosts followed by rapid declines. It is considered a risky investment by investors yet, has the potential for high returns in a fairly short duration (1-2 years). John McAfee, inventor of McAfee anti virus

Read more »

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)