Articles by mark

Track changes in data with the lumberjack %>>%

June 23, 2017 | mark

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R __ data(retailers, … Continue reading →
[Read more...]

Announcing the simputation package: make imputation simple

September 13, 2016 | mark

I am happy to announce that my simputation package has appeared on CRAN this weekend. This package aims to simplify missing value imputation. In particular it offers standardized interfaces that make it easy to define both imputation method and imputation … Continue reading → [Read more...]

stringdist 0.9.4.2 released

September 11, 2016 | mark

stringdist 0.9.4.2 was accepted on CRAN at the end of last week. This release just fixes a few bugs affecting the stringdistmatrix function, when called with a single argument. From the NEWS file: bugfix in stringdistmatrix(a): value of p, for … Continue reading →
[Read more...]

validate version 1.5 is out

June 24, 2016 | mark

A new version of the validate package for data validation was just accepted on CRAN and will be available on all mirrors in a few days. The most important addition is that you can now reference the data set as … Continue reading → [Read more...]

Easy data validation with the validate package

March 25, 2016 | mark

The validate package is our attempt to make checking data against domain knowledge as easy as possible. Here is an example. [crayon-56f5bac53c653388423735/] The summary gives an overview of the number of items checked. For an aggregated test, such as the … Continue reading → [Read more...]

settings 0.2.3

October 27, 2015 | mark

An updated version of the settings package has been accepted on CRAN. The settings package provides alternative options settings management for R. It is aimed to allow for layered options management where global options are the default that can easily … Continue reading → [Read more...]

stringdist 0.9: exercise all your cores

January 26, 2015 | mark

The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R. Secondly, stringdist now employs multithreading … Continue reading → [Read more...]

stringdist 0.8: now with soundex

August 22, 2014 | mark

An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words … Continue reading → [Read more...]

sort.data.frame

August 15, 2014 | mark

I came accross this post on SO, where several solutions to sorting data.frames are presented. It must have been solved a million times, but here's a solution I like to use. It benefits from the fact that sort is an … Continue reading → [Read more...]

A bit of benchmarking with string distances

September 7, 2013 | mark

After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster. His benchmark compares randomly generated character strings … Continue reading → [Read more...]

Approximate string matching in R

August 9, 2013 | mark

I have released a new version of the stringdist package. Besides a some new string distance algorithms it now contains two convenient matching functions: amatch: Equivalent to R's match function but allowing for approximate matching. ain: Similar to R's %in% … Continue reading → [Read more...]

The stringdist package

February 26, 2013 | mark

String metrics have important applications in web search, spelling correction and computational biology amongst others. Many different metrics exist, but the most well-known are based on counting the number of basic edit operations it takes to turn one string into … Continue reading → [Read more...]

Deductive imputation with the deducorrect package

November 26, 2011 | mark

Missing data hinders statistical analyses. Estimating missing values (imputation) prior to analysis is one way to deal with that. In some cases however, the missings need not be estimated at all, since they can be derived with certainty from other … Continue reading → [Read more...]

A multidimensional “which” function

September 16, 2011 | mark

update Henrik Bengtsson commented that which(x, arr.ind=TRUE) gives the same result, rendering the blog below academic (thanks for the comment!). So, for academic interest, I'll leave it. In my defense, I implemented this kind of functionality in C some time … Continue reading → [Read more...]
1 2

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)