Improving data quality with deducorrect

July 1, 2011
By

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

Does your raw numerical data suffer from typos? sign errors? variable swaps? rounding errors? You may be able to fix all that with the deducorrect package. Today, we (that is Edwin de Jonge, Sander Scholtus and myself) uploaded the, 1.0-0 release to CRAN.

The deducorrect package implements methods to solve common errors in numerical data records. To detect errors, you first have to define the rules which your data has to obey. For example, suppose you have a data.frame with three columns: profit, turnover, and cost, subjected to the rules that all values must be positive, the balance account profit + loss = turnover must hold and the profit-to-turnover ration may not
exceed 0.6 (some kind of sanity check). The rules can be defined as follows:

E <- editmatrix(c(
    "cost > 0",
    "profit > 0",
    "turnover > 0",
    "cost + profit == turnover",
    "0.6*turnover >= profit")
)

Here, the editmatrix function from the editrules package was used to create an object of class editmatrix, which holds all the information about the restrictions.

Now let's look at some simple data.

dat <- data.frame(
    cost     = c(-100, 325, 326 ),
    profit   = c( 150, 457, 475 ),
    turnover = c( 250, 800, 800 )
)

Obviously, every record contains some error. In the first record "cost" is wrongly negative, the second appears to have a typo in the "profit" value and the third record has a rounding error in one of the variables.
Using functions from the deducorrect package, such errors can be repaired:
> (dat <- correctTypos(E,dat)$corrected)
  cost profit turnover
1  100    150      250
2  325    475      800
3  326    475      800

The sign error disappeared in the first record. Now let's fix the typo:
> (dat <- correctTypos(E,dat)$corrected)
  cost profit turnover
1  100    150      250
2  325    475      800
3  326    475      800

And finally, the rounding error:

> correctRounding(E,dat)$corrected
  cost profit turnover
1  100    150      250
2  325    475      800
3  326    475      801

And now we have a completely consistent data set :) . So how does it all work? Well basically, correcting signs works by a smart trial and error procedure, correcting typo's works by deriving a set of correction suggestions and testing whether they correspond to typo's (using the Damerau-Levenshtein distance) and correcting rounding errors works by randomly selecting a sufficient number of variables to change so that the data restrictions can be satisfied.

The package can handle multiple sign- or typing errors, possibly masked by rounding errors as well. It logs all changes it makes in a deducorrect object so changes to your data are reproducible.

Interested? download the package and start trying! We included all relevant papers in the package documentation and there's an extensive vignette as well. We're happy to hear suggestions and bug reports. WARNING: SHAMELESS SELF-PROMOTION FOLLOWS: We're talking about the deducorrect and editrules packages at useR!2011, so hope to see you there!

To leave a comment for the author, please follow the link and comment on his blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.