Waffles are for Breakfast It’s been a long time since my last update and I’ve decided to start with Tableau, of all topics! Although open source advocates do not look kindly upon Tableau, I find myself using it frequently and relearning all the...

Three members of the rOpenSci team - Scott Chamberlain, Jenny Bryan, and Rich FitzJohn - as well as many community members will give talks at useR!2019. Many other package authors, maintainers, reviewers and unconf participants will be there too. Don’t hesitate to ask them about rOpenSci packages, software peer review, community, or just say hello if you’re looking for...

This morning swephR version 0.2.1 made it unto CRAN and is now propagating to the mirrors. The goal of swephR is to provide an R interface to the Swiss Ephemeris, a high precision ephemeris based upon the DE431 ephemeris from NASA’s JPL. It covers the time range 13201 BCE to 17191 CE. This new version comes closely after last week’s release and contains only a single albeit...

Today I am so pleased to introduce a new package for calculating weighted log odds ratios, tidylo. Often in data analysis, we want to measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents. One statistic often used to find these kinds of differences in text data is tf-idf....

Chunk Average (CA) is an interesting concept proposed by Matloff in the chapter 13 of his book “Parallel Computing for Data Science”. The basic idea is to partition the entire model estimation sample into chunks and then to estimate a glm for each chunk. Under the i.i.d assumption, the CA estimator with the chunked data

Gaussian processes are a widely employed statistical tool because of their flexibility and computational tractability. (For instance, one recent area where Gaussian processes are used is in machine learning for hyperparameter optimization.) A stochastic process is a Gaussian process if … Continue reading →

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against income … Continue reading Link...

In my previous post https://statcompute.wordpress.com/2019/02/03/sobol-sequence-vs-uniform-random-in-hyper-parameter-optimization/, I’ve shown the difference between the uniform pseudo random and the quasi random number generators in the hyper-parameter optimization of machine learning. Latin Hypercube Sampling (LHS) is another interesting way to generate near-random sequences with a very simple idea. Let’s assume that we’d like to perform LHS for 10 data

Normally when we visualize monthly precipitation anomalies, we simply use a bar graph indicating negative and positive values with red and blue. However, it does not explain the general context of these anomalies. For example, what was the highest or lowest anomaly in each month? In principle, we could use a boxplot to visualize the distribution of the anomalies,...

Unfortunately, I haven’t had as much time to make blog posts in the past year or so. I started taking classes as part of Georgia Tech’s Online Master of Science in Analytics (OMSA) program last summer (2018) while continuing to work full-time, so extra time to code and write hasn’t been abundant for me. Anyways, I figured I would share one neat thing I learned as...

In my last post I scraped some character statistics from the mobile game Star Wars: Galaxy of Heroes. In this post, I’ll be aiming to try out k-means clustering in order to see if it comes out with an intuitive result, and to learn how to integrate this kind of analysis into a tidy workflow using broom. First I’ll load...

I’ve released a version of my pqR implementation of R that has extensions for automatic differentiation. This is not a stable release, but it can be downloaded from pqR-project.org — look for the test version at the bottom — and installed the same as other pqR versions (from source, so you’ll need C and Fortran compilers).

Here is simple modeling problem in R. We want to fit a linear model where the names of the data columns carrying the outcome to predict (y), the explanatory variables (x1, x2), and per-example row weights (wt) are given to us as strings. Lets start with our example data and parameters. The point is: we … Continue reading Programming...

For a lonely soul, you’re having such a nice time (Nothing in my way, Keane) In my previous post, I created the P2 Penrose tessellation according to the instructions of this post. Now it’s time to create the P3 tessellation following the same technique I described already. This is the image of the P3 tessellation: … Continue reading Rhombuses...

I’ve been an R user for a few years now and the data.table package has been my staple package for most of it. In this post I wanted to talk about why almost every script and RMarkdown report I write start with: library(data.table) My memory issues I started working on my licenciate thesis (the argentinian equivalent to a Masters Degree) around mid...

Are you interested in guest posting? Publish at DataScience+ via your editor (i.e., RStudio). Category Basic Statistics Tags Linear Regression R Programming Tips & Tricks Integration is the process of evaluating integrals. It is one of the two central ideas of calculus and is the inverse of the other central idea of calculus, differentiation. Generally, we can speak of integration in two different contexts: the...

With Alfred Galichon and Lucas Vernet, we recently uploaded a paper entitled optimal transport on large networks on arxiv. This article presents a set of tools for the modeling of a spatial allocation problem in a large geographic market and gives examples of applications. In our settings, the market is described by a network that maps the cost of...

Motivation There are several wonderful tools for retrieving information about R packages, some of which are listed below: cranlogs, dlstats and packageRank for R package download stats pkgsearch and packagefinder for searching CRAN R packages crandb provides API for programatically accessing meta-data cchecks for CRAN check results We have used some or all of these to track/monitor our own R packages available on CRAN. Over...

This morning swephR version 0.2.0 made it unto CRAN and is now propagating to the mirrors. The goal of swephR is to provide an R interface to the Swiss Ephemeris, a high precision ephemeris based upon the DE431 ephemeris from NASA’s JPL. It covers the time range 13201 BCE to 17191 CE. The new version 0.2.0 brings two important changes. First, the version of the included Swiss...

A few days ago, I released a new version of my R package, groupdata2, on CRAN. groupdata2 contains a set of functions for grouping data, such as creating balanced partitions… Read More → Indlægget groupdata2 version 1.1.0 released on CRAN blev først udgivet på .

This morning, digest version 0.6.20 went to CRAN, and I will send a package to Debian shortly as well. digest creates hash digests of arbitrary R objects (using the md5, sha-1, sha-256, sha-512, crc32, xxhash32, xxhash64, murmur32, and spookyhash algorithms) permitting easy comparison of R language objects. This version contains only internal changes with a switch to the (excellent) tinytest package....

After a fairly long life on GitHub, my R package, cvms, for cross-validating linear and logistic regression, is finally on CRAN! With a few additions in the past months, this… Read More → Indlægget cvms 0.1.0 released on CRAN blev først udgivet på .

As part of the development of a Shiny application for production using {golem}, we recommend, among other things, working with Shiny-modules. The communication of data between the different modules can be complex. At ThinkR we use a strategy: the stratégie du petit r. We explain everything in this article. What is a module? A module is the combination of...