## Using cosine similarity to find matching documents: a tutorial using Seneca’s letters to his friend Lucilius

Lately I’ve been interested in trying to cluster documents, and to find similar documents based on their contents. In this blog post, I will use Seneca’s Moral letters to Lucilius and compute the pairwise cosine similarity of his 124 letters. Computing the cosine similarity between two vectors returns how similar these vectors are. A cosine similarity of 1 means that the angle between...

## Kites and Darts: the Penrose Tiling

June 3, 2019
By

Agarrada a mis costillas le cuelgan las piernas (Godzilla, Leiva) Penrose tilings are amazing. Apart of the inner beauty of tesselations, they have two interesting properties: they are non-periodic (they lack any translational symmetry) and self-similar (any finite region appears an infinite number of times in the tiling). Both characteristics make them a kind of … Continue reading Kites...

## Pareto Models for Top Incomes

June 3, 2019
By

With Emmanuel Flachaire, we uploaded on hal a paper on Pareto Models for Top Incomes, Top incomes are often related to Pareto distribution. To date, economists have mostly used Pareto Type I distribution to model the upper tail of income and wealth distribution. It is a parametric distribution, with an attractive property, that can be easily linked to economic...

## Learning R: Permutations and Combinations with base R

June 3, 2019
By

The area of combinatorics, the art of systematic counting, is dreaded territory for many people so let us bring some light into the matter: in this post we will explain the difference between permutations and combinations, with and without repetitions, will calculate the number of possibilities and present efficient R code to enumerate all of … Continue reading "Learning...

## What is a Statistic?

June 2, 2019
By

What is a Statistic? A statistic is a mathematical operation on a data set, performed to get information from the data. Below is the R code that generates 20 random samples. The samples are uniformly distributed between 0 and 1. Uniformly means all the data samples are equally likely, like when you flip a coin heads and...

## Become a Bayesian master with bayestestR (0.2)

June 2, 2019
By

bayestestR 0.2 is here! As you might have heard from previous posts, we have recently started to collaborate around the new easystats project, a suite of packages designed to make your life easier. One of the packages, bayestestR, has just been up...

## Become a Bayesian master with bayestestR (0.2)

June 2, 2019
By

bayestestR 0.2 is here! As you might have heard from previous posts, we have recently started to collaborate around the new easystats project, a suite of packages designed to make your life easier. One of the packages, bayestestR, has just been up...

## Management accounting with balance sheet and income statement in R

June 2, 2019
By

Are you interested in guest posting? Publish at DataScience+ via your editor (i.e., RStudio). Category Data Management Tags Data Frames Data Manipulation Databases R Programming The demand for data analysis/science and data management are increasing in the field of management accounting. In this article, you learn how to get data for management accounting with the balance sheet and income statement in R. Furthermore you learn how...

## Same name, different bird

June 2, 2019
By

What do we mean when we see a bird and say that it’s a robin? A simple description would be a small brownish bird with a red breast. But that’s a superficial description, and when we say “robin” what we mean depends on your location; you don’t have to look very closely to see that the European and American...

## Visualizing the Green New Deal with R Shiny

June 2, 2019
By

New York has pressed on with environmental legislation while the nation dawdles - where and by whom will the impact be felt? Shiny app | GitHub As mentioned in my last post, effective and timely legislation will be a key weapon in the battle to keep greenhouse gas emissions at sustainable levels and avoiding an

## Trawling Through iOS Backups For Treasure (a.k.a. How to fish for target files in iOS backups) with R

June 2, 2019
By

In a recent previous post I brazenly talked over the “hard parts” of how I got to the target SQLite file that houses “mowing history” for what has become my weekend obsession. So, we’ll cover just how to do that (find things in iOS backups) in this post along with how to deal with some... Continue reading →

## Quick Hit: Handling Cocoa Core Data Timestamps in R

June 1, 2019
By

For the first time ever we got a new riding mower this weekend. We’ve always haggled to keep the one sellers were using with any given house we’ve purchased over the years (that was big enough for a yard that “requires” a riding mower). We ended up getting a model from John Deere and the... Continue reading →

June 1, 2019
By

It’s a common situation: you want to code and debug in R *and* leverage RMarkdown for a presentation or document. The challenge: file paths. Executing code in the console and from within a saved RMarkdown document typically requires distinct file paths to locate data files. While you’re writing your code and debugging, you’ve probably got your source … Continue reading Here...

## Running cross_validate from cvms in parallel

May 31, 2019
By

The cvms package is useful for cross-validating a list of linear and logistic regression model formulas in R. To speed up the process, I’ve added the option to cross-validate the… Read More → Indlægget Running cross_validate from cvms in parallel blev først udgivet på .

## Getting an environment’s name in R: the envnames package

May 31, 2019
By

The name of user-defined environments cannot be retrieved in base R. Examples are shown on how the envnames package is used to circumvent this problem, in addition to looking for objects defined in user-defined environments. The post Getting an environment's name in R: the envnames package appeared first on MilanoR.

## Making a Command Line HTML Rendering Script for “The Art of the Command Line” (in R)

May 31, 2019
By

The Feedly category I have setup for git-stalking has indicated a fairly massive interest in Joshua Levy’s The Art of the Command Line. What is “The Art of the Command Line”? To quote the author(s): Fluency on the command line is a skill often neglected or considered arcane, but it improves your flexibility and productivity... Continue reading →

## Full EARL London 2019 agenda available

May 31, 2019
By

Once again, we are delighted to announce a stellar line up of speakers for this year’s EARL Conference; from Retail and Insurance to Media, Manufacturing and Pharmaceutical, the range of industries now using R stats in their workflow continues to grow. If you are interested to hear why companies such as BBC News, BMW Group, Arla Foods, GSK, Microsoft, Hiscox, Mumsnet and...

## RODBC helper function

May 31, 2019
By

The number of times I have to connect to SQL and I forget part of the RODBC command to connect to an internal data table. As part of a project I am working on I have been connecting to lots of different sources and became tired of typing lots of lines and repeating the same...

## My RStudio Configuration

May 30, 2019
By

Whenever I need to install RStudio on a new machine, I have to think a bit about the configuration options I’ve tweaked. Invariably, I miss a checkbox that leaves me with slightly different RStudio behavior on each system. This post includes screenshots of my RStudio configuration and custom keyboard shortcuts for RStudio 1.3, MacOS, so … Continue reading My...

## How to start a new package with testing in R

May 30, 2019
By

# Navigate where you want your folder to be locatedsetwd("C:/Users/chief/Documents/Github")# Assumes usethis is installedusethis::create_package("foo")# Say yes or no to next (annoying) popup window, it doesn't matter.# Add a test environmentsetwd("foo")usethis::use_testthat()# Add first test function to at least get something in that folder.# Go to foo\tests\testthat# and add this file.context("foo")library(foo)test_that("I'm testing something", {  # do something with your code  expect_equal(1:4,...

## More Bayes and multiple comparisons

In my last post I had a little fun comparing perspectives among Bayesian, frequentist and programmer methodologies. I took a nice post from Anindya Mozumdar from the R Bloggers feed and investigated the world’s fastest man. I’ve found that in writing these posts two things always happen. I learn a lot, and I have follow-on questions or thoughts. This time is no exception, the last post made...

## 78th #TokyoR Meetup Roundup!

May 30, 2019
By

With the arrival of summer, another TokyoR User Meetup! On May 25th, useRs from all over Tokyo (and some even from further afield - including Kan Nishida of Exploratory, all the way from California!) flocked to Jimbocho, Tokyo for...

## How to Become a Data Scientist

May 30, 2019
By

This question and its variations are the most searched topics on Google. As a practicing datascience professional, and manager to boot, dozens of people ask me this question every week. This post is my honest and detailed answer. Step 1 – Coding & ML skills You need to master programming in either R or Python.

## Quick and easy t-SNE analysis in R

May 30, 2019
By

t-SNE is a useful dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex patterns of Gaussian clusters in multidimensional space compared to PCA. Although is not suited to finding outliers

## Modeling the Law of Practice in Stan

May 30, 2019
By

Practice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the power law and exponential function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction...

## xaibot – conversations with predictive models!

May 30, 2019
By

If you could talk to a predictive machine learning model, what would you ask for? Try! Michał Kuźba is developing a mind-blowing project – xai chat-bot. Dialog based system that helps to explore and understand predictive models through natural language conversations (type, speak or phone the model 😉 ). For example, imagine that you have … Czytaj dalej xaibot...

## Cognitive capitalism chapter reworked

May 29, 2019
By

The Cognitive capitalism chapter of my evidence-based software engineering book took longer than expected to polish; in fact it got reworked, rather than polished (which still needs to happen, and there might be more text moving from other chapters). Changing the chapter title, from Economics to Cognitive capitalism, helped clarify lots of decisions about the

## April 2019: “Top 40” New CRAN Packages

May 29, 2019
By

One hundred eighty-seven new packages made it to CRAN in April. Here are my picks for the “Top 40”, organized into ten categories: Biotechnology, Data, Econometrics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization. Biotechnology genpwr v1.00: Provides functions for power and sample size calculations for genetic association studies allowing for mis-specification of the model of genetic susceptibility....