I wanted yet another opportunity to get to use the fabulous caret package, but also to finally give plot.ly a try. To scratch both itches, I dipped into the UCI machine learning library yet again and came up with a

Inspired by this post about visualizing shrinkage on Coppelia, and this thread about visualizing mixed models on Stack Exchange, I started thinking about how to visualize shrinkage in more than one dimension. One might find themselves in this situation with a varying slope, varying intercept hierarichical (mixed effects) model, a model with two varying intercepts, etc. Then...

by Bob Horton, Data Scientist, Revolution Analytics From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need...

Last week version 1.0 of the caretEnsemble package was released to CRAN. I have co-authored this package with Zach Mayer, who had the original idea of allowing for ensembles of train objects in the caret package. The package is designed to make it easy...

Thanks to user dnlbrky, we now have a third way to accomplish sessionizing log data for any arbitrary time out period (see methods 1 and 2), this time using data.table from R along with magrittr for piping: I agree with dnlbrky in that this feels a little better than the dplyr method for heavy SQL users

Would love to get a post from you for KDnuggets (Gregory Piatetsky, KDnuggets President) Some days ago, Gregory Piatetsky invited me to write a post for KDnuggets. I couldn't say no. He suggested to me some topics and I decided to experiment around climate change to demonstrate how easy is to see some evidences of

Since last CRAN release of Rcpp11, I've started to work on the next iteration of R/C++ support with Rcpp14 by propagating changes to both implementations, e.g. the Strict class that I mentionned in this post. But now, I'm starting to make unique chan...

Bruno Rodrigues teaches a class on applied econometrics at the University of Strasbourg, with a focus on implementing econometric concepts in the R language. Since many of the students don't have any previous programming background, he's put together a tutorial on the basics of applied econometrics with R. The first two chapters serve as a general-purpose beginners' introduction to...

A considerable share of Twitter accounts is not actually run by humans. According to a recent release by Twitter, `up to approximately 8.5%' of the active users are bots or third-party software that automatically aggregates tweets. Bots can follow other users, retweet content or post content on their own. What they say is essentially generated by scripts. Take...

I am very glad that ggtree is now available via Bioconductor. This is my 6th Bioconductor package. ggtree now supports parsing output files from BEAST, PAML, HYPHY, EPA and PPLACER and can annotate phylogenetic tree directly using plot methods. Find out more at http://www.bioconductor.org/packages/3.1/bioc/html/ggtree.html and check out the vignette, http://www.bioconductor.org/packages/3.1/bioc/vignettes/ggtree/inst/doc/ggtree.html

Business Intelligence (BI) can be simply described as extracting useful informations from the data. This is quite a broad process as the source data structure (and quality) can vary, as well the useful information structure can vary. More technically process of such transformation can be described as ETL (extract, transform, load), plus presentation of the useful information. The...

It is amusing coincidence that another MOOC that I took this week (Geospatial Intelligence & the Geospatial revolution) mentioned disasters. About the other course see my recent Disasters: Myth or the Reality post.In Geospatial Intelligence they gave a weird assignment: one need to mark the location on the world map where...

Every now and then we get reports from CRAN about our packages failing a test there. A challenging one concerns UBSAN, or Undefined Behaviour Sanitizer. For background on UBSAN, see this RedHat blog post for gcc and this one from LLVM about clang. I had written briefly about this before in a blog post introducing the...

D Kelly O’Day did a great post on charting NASA’s Goddard Institute for Space Studies (GISS) temperature anomaly data, but it sticks with base R for data munging & plotting. While there’s absolutely nothing wrong with base R operations, I thought a modern take on the chart using dplyr, magrittr & tidyr for data manipulation

In this post I will run SAS example Logistic Regression Random-Effects Model in four R based solutions; Jags, STAN, MCMCpack and LaplacesDemon. To quote the SAS manual: 'The data are taken from Crowder (1978). The Seeds data set is a 2 x 2 fa...

With this post I want to introduce my newly bred ‘onls’ package which conducts Orthogonal Nonlinear Least-Squares Regression (ONLS): http://cran.r-project.org/web/packages/onls/index.html. Orthogonal nonlinear least squares (ONLS) is a not so frequently applied and maybe overlooked regression technique that comes into question when one encounters an “error in variables” problem. While classical nonlinear least squares (NLS) aims

Last week, in our Inequality course, we've been looking at data. We started with some simulated data, only a few of them > library("ineq") > load(url("http://freakonometrics.free.fr/income_5.RData")) > (income=sort(income)) 19233 23707 53297 61667 218662 How could we say that there is inequality in this sample? If we look at the wealth owned by the poorest, the poorest person (1...

On of the most frequently asked questions about the BayesFactor package is how to do multiple comparisons; that is, given that some effect exists across factor levels or means, how can we test whether two specific effects are unequal. In the next two posts, I'll explain how this can be done in two cases: in Part 1, I'll cover...

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio: Download and unzip the f1db.sql.gz file to f1db.sql install these docker-mysql-scripts run

While skimming Professor Hadley Wickham's Advanced R I got to thinking about nature of the square-bracket or extract operator in R. It turns out "" is a bit more irregular than I remembered. The subsetting section of Advanced R has a very good discussion on the subsetting and selection operators found in R. In particular

Almost any PC today is multicore. Dual-core is standard, quad-core is easily attainable for the home, and larger systems, say 16-core, are easily within reach of even smaller research projects. In addition, large multicore systems can be "rented" on Amazon EC2 and so on. The most popular way to program on multicore machines is to

I rolled out a big update to the rNOMADS package in R about two weeks ago. Now, the list of real time weather, ocean, and sea ice models available through rNOMADS updates automatically by scraping the NOMADS web site. This way, changes in model inventories will be instantly reflected in rNOMADS without the need for

Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You'll just need a backround in basic math and programming to follow along and complete homework...