## Extracting and Enriching Ocean Biogeographic Information System (OBIS) Data with R

January 25, 2017
By

Programmatic access to biodiversity data is revolutionising large-scale, reproducible biodiversity research. In the marine realm, the largest global database of species occurrence records is the Ocean Biogeographic Information System, OBIS. As of January 2017, OBIS contains 47.78 million occurrences of 117,345 species, all openly available and accessible via the OBIS API. The number of questions to address...

## Modelling extremes using generalized additive models

January 25, 2017
By

Quite some years ago, whilst working on the EU Sixth Framework project Euro-limpacs, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to...

## A Glimpse into The Daily Life of a Data Scientist

January 24, 2017
By

A couple of weeks ago, I had a discussion with a co-worker regarding a project I was involved in, I felt that there was no clear understanding of the daily challenges data scientists face. Few days later, I was at Rstudio::Conf 2017 where I met lots of data scientists from academia and industry. Later on, I described one of...

## a typo that went under the radar

January 24, 2017
By

A chance occurrence on X validated: a question on an incomprehensible formula for Bayesian model choice: which, most unfortunately!, appeared in Bayesian Essentials with R! Eeech! It looks like one line in our LATEX file got erased and the likelihood part in the denominator altogether vanished. Apologies to all readers confused by this nonsensical formula!Filed

## Building a machine learning model with the MicrosoftML package

January 24, 2017
By

Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including: Linear (rxFastLinear) and logistic (rxLogisticRegression) model functions based on the Stochastic Dual Coordinate Ascent method; Classification/regression...

## “smooth” package for R. es() function. Part IV. Model selection and combination of forecasts

January 24, 2017
By

Mixed models In the previous posts we have discussed pure additive and pure multiplicative exponential smoothing models. The next logical step would be to discuss mixed models, where some components have additive and the others have multiplicative nature. But we won’t spend much time on them because I personally think that they do not make

## Descriptive Analysis of MLST Data for MRSA

January 24, 2017
By

During one of my summers, I had the opportunity to conduct some research on the prevalence of methicillin-resistant Staphylococcus aureus (MRSA) in vulnerable populations and examining US emergency department data and I thought this would be a pretty interesting topic to expand on for my thesis in lieu of the increasing concerns of antimicrobial resistance, … Continue...

## Building Shiny App Exercises (part 5)

January 24, 2017
By

RENDER FUNCTIONS In the fourth part of our series we just “scratched the surface” of reactivity by analyzing some of the properties of the renderTable function. Now it is time to get deeper and learn how to use the rest of the render functions that shiny provides. As you were told in part 4 these

## Distribution of Mean of the Combinations of a Set.

January 24, 2017
By

For some purpose I found myself generating and analyzing the average of the combinations of a set and when I generated the corresponding histogram I was surprised by its shape.It should be remembered that the combinations C(m, n) of a set are the number of ...

## xml2 1.1.1

January 24, 2017
By

Today we are pleased to release version 1.1.1 of xml2. xml2 makes it easy to read, create, and modify XML with R. You can install it with: install.packages("xml2") As well as fixing many bugs, this release: Makes it easier to create an modify XML Improves roundtrip support between XML and lists Adds support for XML

## Creating a “balloon plot” as alternative to a heat map with ggplot2

January 24, 2017
By

Heat maps are great to compare observations with lots of variables (which must be comparable in terms of unit, domain, … Read More →

## sparklyr 0.5

January 24, 2017
By

We’re happy to announce that version 0.5 of the sparklyr package is now available on CRAN. The new version comes with many improvements over the first release, including: Extended dplyr support by implementing: do() and n_distinct(). New functions including sdf_quantile(), ft_tokenizer() and ft_regex_tokenizer(). Improved compatibility, sparklyr now respects the value of the ‘na.action’ R option and dim(), nrow() and ncol(). Experimental

## Euler Problem 9 : Special Pythagorean Triple

January 24, 2017
By

Solution to Euler Problem 9 in the R Language: Find the Pythagorean triple for which a+b+c equals 1000. Continue reading → The post Euler Problem 9 : Special Pythagorean Triple appeared first on The Devil is in the Data.

## How to do an analysis in R (part 2, visualization and analysis)

January 24, 2017
By

In several recent blog posts, I've emphasized the importance of data analysis. My main point has been, that if you want to learn data science, you need to learn data analysis. Data analysis is the foundation of practical data science. With that statement in mind, I want to show you step-by-step what an analysis looks like in R ... The post...

## How to use viridis colors with plotly and leaflet?

“… avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.” - Envisioning Information, Edward Tufte, Graphics Press, 1990 Choosing colors for your plot is not so simple. Why is that so? First of all, it depends on numerous things… What plot are you creating? What is...

## Parallel Computation with R and XGBoost

January 23, 2017
By

Share This: XGBoost is a comprehensive machine learning library for gradient boosting. It began from the Kaggle community for online machine learning challenges, and then maintained by the collaborative efforts from the developers in the community. It is well known for its accuracy, efficiency and flexibility for various interfaces: the computational module is implemented in C++,

## French villages and a sort of resolution

January 23, 2017
By

Sort of introduction to this post and hopefully the next ones I usually don’t have any New Year resolution. However, recent tweets about productivity – from people I actually find productive and inspiring – made me ponder a bit on my unfinished...

## Upcoming R Conferences

January 23, 2017
By

Since a few new events have been announced recently, I thought I'd give a run-down on some major R conferences coming up in the next six months. February 18: satRdays, Cape Town (South Africa). This is the second in a series of one-day conferences inspired by an R Consortium proposal. The first event in Budapest was a great success,...

## Principal Component Analysis in R

January 23, 2017
By
$Principal Component Analysis in R$

Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. PCA is particularly powerful in dealing with multicollinearity and variables that … Continue...

## Choosing Software to Publish your Data Science Portfolio

January 23, 2017
By

I’ve recently spoken to several people who Have decided to create a portfolio of their data science projects Are new to online publishing They frequently have... The post Choosing Software to Publish your Data Science Portfolio appeared first on AriLamstein.com.

## 2016, the Earthquake Annus Horribilis of Italy

January 23, 2017
By

There is no exaggeration in stating that historic heritage is one of the most outstanding and valuable assets of Italy. The smallest villages or the largest cities, all boast hundred- (sometimes thousand-) year old buildings of great cultural, architectural, or artistic interest. Amazingly enough, a vast majority of these Read More ...

## Where Cohen went wrong – the proportion of overlap between two normal distributions

January 23, 2017
By

I've received many emails regarding the percent of overlap reported in my Cohen's d visualization. Observant readers, have noted that I report a different number than Cohen (and other authors). For instance, if we open p. 22 in Cohen's Statistical power analysis for the behavior sciences, we see that Cohen writes that d = 0.5 means a 33...

## Releasing RQGIS 0.2.0

January 23, 2017
By

Today we are happy to announce a new version of RQGIS! RQGIS establishes an interface between R and QGIS, i.e. it allows the user to access the more than 1000 QGIS geoalgorithms from within R.

## Trumpworld Analysis : Ownership Relations in his Business Network

January 23, 2017
By

Analysis of the ownership relationships between organisations associated with Donald J. Trump. A social network analysis of Trumpland using the igraph package in R. Continue reading → The post Trumpworld Analysis : Ownership Relations in his Business Network appeared first on The Devil is in the Data.

## Detect Lines in Digital Images

January 23, 2017
By

As part of our data science training initiative, bnosac is also providing a course on computer vision with R & Python which is held in March 9-10 in Leuven, Belgium (subscribe here or have a look at our full training offer here). Part of the course is covering finding blobs, corners, gradients, edges & lines in images. For...

January 23, 2017
By

A handy little trick I picked up today when using readr. Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g.,...

## Monotonic Binning with Smbinning Package

January 22, 2017
By

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view: 1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity. 2.

## Applying diffusion theory to Google Trends

January 22, 2017
By

on example of Candy Crush Saga adoption -

## Interactive BMI Chart

January 22, 2017
By

I was recently listening to the #WhoIsFat Joe Rogan podcast where comedians Bert Kreischer and Tom Segura had their weight loss challenge weigh-ins. The challenge was for both guys to get out of the “obese” category and into the merely “overweight” category. If one made it and the other didn’t, the loser would pay for a trip to Paris...