Major Update for revss Package for R – v3.1.0
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The revss Package for R Receives Major Update to Version 3.1.0
This is a big one! The revss package for R, which provides robust estimation for small samples, received a major, breaking update. The entire calculation engine was rewritten, new functionality added, and massive Monte Carlo analyses were run to calculate bias reduction factors. It should be at version 3.1.0, even though CRAN is showing 3.0.0. What happened and why? Where to even start?
, via Wikimedia Commons” width=”225″ height=”215″ />
Estimation
Normal Behavior
In most statistical endeavors, we rely on the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). These allow estimation and prediction of a variety of events with surprisingly good results. The CLT in particular describes the behavior of averages—well, standardized sums. The behavior of these averages becomes more and more “predictable” as more observations contribute to the average. Many times, statisticians have access to the volumes of data needed for reliance on these laws. They can treat their estimators as behaving normally, and by normal I mean having Gaussian behavior. Therefore, estimating the location and scale of the population from the sample is as easy as looking at the sample mean and standard deviation.
The Small Sample Problem
Sometimes, however, we may know that the underlying population is Gaussian but we have too few samples to rely directly on the CLT. At these times, statisticians need often turn to other measures of location and scale for robust estimation for small samples. These are measures which are less likely to be affected by an outlier—a much bigger problem with a small sample size. One such class of measures are those which minimize the absolute value of deviations and not the square of the deviations. For location, the most well known is the median. For scale, there are a few.
Consider that when calculating a scale, there are three choices:
- What to pick as the center
- What penalty or distance function to apply
- How to aggregate/distill the measure to a single value
The variance, for example, selects the mean for (1), squared distance for (2), and mean again for (3). For robust scale measures, one can pick the median as the center instead of the mean, the absolute distance as the penalty, and aggregate by median. Those three choices result in the median absolute deviation from the median, known more commonly as “median absolute deviation” or MAD. But there are many others simply by changing the choices above.
M-estimators and the R package
In robust estimation theory, there are a number of general classes of estimators including M-, L-, and R-estimators. M-estimators, at the risk of oversimplifying, are maximum-likelihood type estimators. These are estimators which minimize some kind of penalty. For example, the variance minimizes the squared distance between every point and the sample mean. In robust theory, there are other penalty functions. In 2002, Rousseeuw & Verboven (R&V) suggested an M-estimator type algorithm for location and scale. Having implemented it in Excel for some reinsurance rating/ranking work I was doing, I decided to turn it into an R package, which I released in 2020. As the estimator sometimes uses MAD and sometimes the mean absolute deviation from the median, called by R&V as “average deviation from the median” or ADM, I built an ADM function as well.
, via Wikimedia Commons” width=”204″ height=”307″ />The Rabbit Holes
Bias Factors?
Fast forward half a decade, and in 2025, someone raised a GitHub issue with the package, correctly pointing out that R&V were using an bias-adjusted form of MAD in their work, called
with factors calculated by Croux & Rousseeuw (CR) in 1992. I figured it would be an easy fix, just add the factors, right? Wrong! Because then I thought, why limit it to MAD? Why not distribute it over the following estimates:
- Mean Absolute Deviation from the Mean (MADM)
- Mean Absolute Deviation from the Median (MADMd / ADM)
- Median Absolute Deviation from the Mean (MdADM)
- Median Absolute Deviation from the Median (MdADMd / MAD)
- Robust Scale Estimate of CV with unknown location
- Robust Scale Estimate of CV with known location
However, CR only posted their factors for MAD. Moreover, they posted three-digit factors and I felt that there should be accurate four-digit factors to match with how the stats package uses 1.4826 in their implementation of MAD. So straight down the rabbit hole I went!
Simulation Studies
Ideally, simulating small samples from a known distribution enough times generates a level of confidence that the bias for small samples could be accurately estimated, and then balanced. In my case, I decided to look at all
, based on Table 1 of CR. I would follow another idea of CR, which was to calculate specific factors, let’s say
, for
and a single asymptotic factor
which would be used to calculate
for
. I am currently writing a fuller treatment of the study, the factors, the results, and the package, which I hope to submit to the Journal of Statistical Software in the near future.
Monte Carlo Standard Errors
I first started with small-scale simulations studies to estimate the standard errors of the six metrics above. This would inform how many simulations would be needed, as Monte Carlo Standard Errors (MCSE) are defined as
where
is the standard error of the parameter and
is the number of independent samples. For accuracy, one would want, at a minimum, to have the MCSE less than the number of digits. So for four-digit accuracy, the minimum acceptable MCSE would be strictly less than
. Better would be for it to be less than half that amount, so that it can be expected to “round” correctly. Of course, for every two times reduction in MCSE you need four times the simulations, due to the square root.
How Many Simulations?!
The small scale studies, repeated many times, indicated that to achieve MCSEs of less than
, I would need something on the order of 660 million simulations for for individual factors and 66 million simulations for sample sizes between 10 and 100. The way the package was written in pure R, that would take days—if not weeks—to complete. That was not an option. So, the rabbit hole deepens; it’s refactoring time!
Complete Engine Rewrite
I left the R engine complete and built workhorse functions first in C and then in Fortran with a bridge to C, using the original R as a check. I didn’t keep the timings, unfortunately, but it was fastest to move all the heavy lifting to Fortran and write thin C wrappers to facilitate the movement of the SEXPs. Even faster than having the work done directly in C. I have experience with this process, and it was not difficult to get the functionality ported. Once I brought the unit test coverage back to 100% and was certain that the mechanism was correct, it was time for the micro-optimizations.
The Curmudgeon Uses the LLMs
Anyone following me for the past decade knows that I am an avowed skeptic of the proliferation of LLMs/AI. However, I have coded in R for over 20 years and in other languages supporting R packages for over 10 years. I think I can consider myself competent creating R packages with compiled code. I’m also decent at optimizing the code for speed. I’m no Donald Knuth or even Tomas Kalibera. However, I do know how to envision data structures and make naïve data flows more efficient. So, I decided to make use of the more capable models. I was confident that I could spot the hallucinations, especially with a 100% coverage test harness. Therefore, I used ChatGPT, Claude, and Grok. For the first two I used free accounts but I can use the full Grok given I have a premium subscription to X.
Force Multipliers
I can, and probably should, write an completely separate blog post about experiences, but in short, I was very impressed. The models do the job of scraping all the standard and non-standard sites, like Stack Overflow, Rosetta Code, R-devel, etc. More than that, their pattern matchings have developed to the point where they can provide meaningful suggestions. Yes, I found mistakes with each model; yes, they hallucinated; and yes, they can get all obsequious when you catch them. And no, they still aren’t great at keeping the structure of the psuedocode in memory. That being said, there is no question that I learned a lot, and I will certainly apply some of these enhancements to my other packages. In my opinion, Grok seemed to respond the best overall, probably because it was the most advanced model. Also, Claude consistently outperformed ChatGPT for the work I needed. Your mileage may vary.
Numbers, Numbers, and Even More Numbers!
, via Wikimedia Commons” width=”209″ height=”209″ />After many iterations of updating and refining both the compiled and interpreted code, it was time to formulate the simulation itself. This also benefited from investigating different variate generation and statistic estimation schemas. I also decided that ensuring that the MCSEs were under
was sufficient, so longs as I disclosed it. The final schema consisted of two sets of simulations. one for the individual factors and one for the asymptotic factor.
Small Sample Factors
The individual factor simulations consisted of 200M sets of of the six metrics for samples between 2 and 9. In theory this would require 9.6B values, which exceeded even my 192GB RAM. Instead, I used running sum algorithms and could discard the generated variates after each iteration. Even using 14 of my 16 cores in parallel, the process still took hours. That is much less than days, though!
Sample Asymptotes
For the asymptotic calculation many fewer simulations were needed per sample, since when
10″ title=”Rendered by QuickLaTeX.com” height=”16″ width=”62″ style=”vertical-align: -2px;”/> the CLT starts peeking in again. This time, an MCSE of
for all estimates was possible using only 16.5M simulations. Of course, this would be for 6 metrics and 91 different sample sizes, so it’s still over 9B values. So I used the same mean and error estimation algorithms.
Publish and Perish?
Armed with these simulations, I was able to generate more precise bias reduction factors for all six scale metrics used in the revss package. Finally! Using these factors, I rewrote the functions admn, madn, robLoc, and robScale to use these factors. I also allow use of the CR factors for madn. See the package itself and the upcoming paper for more detail.
After these weeks of research and development, I settled on the engine and statistics of the package refactor. At the time, there were no other packages which depended, imported, or even suggested revss, so I felt comfortable updating the major version and breaking the API. Once again, I spent hours bringing the test harness to 100% coverage and writing documentation. I roll my own Rds; I prefer then to roxygen or any devtools for that matter. This ties back to my philosophy on dependency hell.
Repeating the change/build/test/check/covr cycle dozens of times, I thought I was ready for publication. I will put it down to my persistent insomnia and possible tiredness-induced euphoria, but I pushed to CRAN without changing the defaults to use the factors! That was a complete brain-freeze on my part. Thankfully, CRAN allowed me to push an update the next day. However, the updated version is still not live as of the publishing of this post.
Current State
So right now, the version on CRAN is the flawed one—3.0.0. It has conflicting information between its documentation and its functionality, stray unneeded functions, and some other embarrassing shortcomings. The proper release is 3.1.0, once it gets pushed to CRAN. For now, it lives on the master branch of its GitHub repository. The last commit is not tagged 3.1.0 because I only tag after CRAN acceptance. I learned that the hard way over the years! Once the CRAN version reads 3.1.0, it should be safe to download from CRAN as well. I will probably drop a blog note when that happens.
As always, I very much appreciate feedback, constructive criticism, and examples where my packages have helped others. Please reach out and let me know! Happy estimating!
References
Croux, C., and Rousseeuw, P. J. (1992), “Time-Efficient Algorithms for Two Highly Robust Estimators of Scale,” in Computational Statistics, eds.Y. Dodge and J. Whittaker, Heidelberg: Physica-Verlag HD, pp. 411–428.
Rousseeuw, P. J., and Verboven, S. (2002), “Robust estimation in very small samples,” Computational Statistics & Data Analysis, 40, 741–758. https://doi.org/10.1016/S0167-9473(02)00078-6.
The post Major Update for revss Package for R – v3.1.0 appeared first on Strange Attractors.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.