hash-2.0.0

April 30, 2010
By

(This article was first published on Open Data Group » R, and kindly contributed to R-bloggers)

hashThe hash-2.0.0 package has been uploaded to CRAN.  This version was developed in conjunction with R-2.11.0 and was refactored for performance.   hash-2.0.0 requires R-2.10.0 or later and will not be supported on earlier versions of R.  This is a result of recent changes to the language itself.

Importantly: Understand that hash-2.0.0, breaks backward compatibility; code written with previous versions of the hash package are not guaranteed to work with this or future versions. This is due to changes made in order to achieve much higher performance.  Assignments and look-ups are achieved more quickly through direct inheritance of environments, stripping of non-essential customizations and reliance on core and primitive functions.

Here is a summary of major changes:

  • Coercion of keys to valid R names ( i.e. non-blank character values) is not the responsibility of the user.  The four accessor functions: [, [[, $, values, no longer do this automatically.  An error results if a proper R name is not provided.
  • The default for missing keys has changed from NA to NULL. This is to match the behavior lists in trying to access non-existing objects in R.  ( For a more complete, discussion, see my previous blog post discussing the differences between NA and NULL. )

    • Custom behavior for accessing non-existent keys has been removed.  Access to non-existing keys will always yield NULL.  Consistency is often better than customization.

ChangeLog and TODO track many technical details; here I will discuss only the more important changes:

Performance

Included in this version is a demo script that runs benchmarks (demo(hash-benchmarks).  One of the questions that has been repeatedly posed, often in the context of look-up, is:  how does this compare to native R named lists and vectors? In other words, how much quicker is accessing a value on a hash / environment as opposed to a list (or vector)?  This is a difficult questions, and generally depends on the size of the hash or list.  My rule of thumb is that it is quicker to look-up elements on lists and vectors less than about 500 elements.  After ~500 elements, hashes and environments greatly outperform lists.  The difference increases relative to the size of the object.  However, look-ups for all these objects are very fast if objects are small  ( >120,000 / sec ).  So unless you are doing many serial look-ups, hashes are likely the better option.

I have written previously about hashes in R [1] [2], and will continue to  discuss the evolution of R hashes on this blog.  Additionally I will be speaking on this and related work at useR!2010 (July 20-23.)

To leave a comment for the author, please follow the link and comment on his blog: Open Data Group » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.