(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

I'm excited to share that one of my data science heroes will be a presenter at the Microsoft Data Science Summit in Atlanta, September 26-27. Edward Tufte, the data visualization pioneer, will deliver a keynote address on the future of data analysis and the how to make more credible conclusions based on data.

If you're not familiar with Tufte, a great place to start is to read his seminal book Visual Display of Quantitative Information. First published in 1983 — well before the advent of mainstream data visualization software — this is the book that introduced and/or popularized many familiar concepts in data visualization today, such a small multiples, sparklines, and the data-ink ratio. Check out this 2011 Washington Monthy profile for more background on Tufte's career and influence. Tufte's work also influenced R: you can easily recreate many of Tufte's graphics in the R graphics system, including this famous weather chart.

The program for the Data Science Summit looks fantastic, and will also include keynote presentations from Microsoft CEO Satya Nadella and Data Group CVP Joseph Sirosh. Also there's a fantastic crop of Microsoft data scientists (plus yours truly) giving a wealth of practical presentations on how to use Microsoft tools and open-source software for data science. Here's just a sample:

- Jennifer Marsman will speak about building intelligent applications with the Cognitive Services APIs
- Danielle Dean will describe deploying real-world predictive maintenance solutions based on sensor data
- Brandon Rohrer will give a live presentation of his Data Science for Absolutely Everybody series
- Frank Seide will introduce CNTK, Microsoft's open source deep learning toolkit
- Maxim Likuyanov will share some best practices for interactive data analysis and scalable machine learning with Apache Spark
- Rafal Lukawiecki will explain how to apply data science in a business context
- Debraj GuhaThakurta and Max Kaznady will demonstrate statistical modeling on huge data sets with Microsoft R Server and Spark
- David Smith (that's me!) will give some examples of how data science at Microsoft (and R!) is being used to improve the lives of disabled people
- … and many many more!

Check out the agenda for the breakout sessions on the Data Science Summit page for more. I hope to see you there: it will be a great opportunity to meet with Microsoft's data science team and see some great talks as well. To register, follow the link below.

Microsoft Data Science Summit, September 26-27, Altanta GA: Register Now

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

When someone tries to claim that a #seriousacademic should not use twitter... pic.twitter.com/ocL753NFSw

— Kitty Kat, PhD. (@academickitty) August 18, 2016

One part of the article that struck me as especially misguided was the author’s dismissal of conference tweeting:

I see more and more of them live tweeting and hashtagging their way through events…. When did it become acceptable to use your phone throughout a lecture, let alone an entire conference? No matter how good you think you are at multitasking, you will not be truly focusing your attention on the speaker, who has no doubt spent hours preparing for this moment.

I personally haven’t been a #seriousacademic in over a year, having since become a #sillydatascientist. So I felt no shame in live-tweeting the heck out of the two conferences I attended this summer: userR and JSM (Joint Statistical Meetings).

There’s a great analysis here of why people tweet during conferences. For me Twitter best serves as sort of **“public diary”**- I’m not very good at taking notes, and tweeting lets me create a record of what I found most interesting that I can refer to later. So as the summer ends, I’m looking back at my “notes” from these two conferences, and sharing my thoughts.

(If you’re a Serious Academic who is against live-tweeting of conferences, get out now, this post isn’t going to get any better.)

At both conferences I gave a talk on my broom package for tidying model outputs. You can find the slides here, along with the useR video. I was especially impressed by the turnout for my talk at the useR conference, where I got to introduce broom to a large audience.

"Tidy data works until you start doing statistical modeling" – @drob (creator of awesome 'broom' package) #useR2016 pic.twitter.com/DGHwXIakFi

— Mikhail Popov (@bearloga) June 29, 2016

Slides from my #useR2016 talk yesterday are here: https://t.co/7brRt7eBV0 #rstats pic.twitter.com/24BOG0DHdl

— David Robinson (@drob) June 29, 2016

At JSM I also got to present an e-poster about some of the analysis of software developers I’ve done at Stack Overflow:

#JSM2016 folks: come see my e-poster on what @StackOverflow data tells us about landscape of software development pic.twitter.com/nkzcOYZsBK

— David Robinson (@drob) August 1, 2016

You can find the full slides here. The short version is that we can cluster programming languages based on which get used together:

This also serves as a useful 2-dimensional layout for comparing technologies. For example, we could visualize which areas of the developer landscape are currently growing (red) or shrinking (blue), in terms of % of Stack Overflow questions:

For instance, we can see that the data science cluster (including R and machine-learning) is growing in its share of Stack Overflow questions, while the C/C++ cluster below it is mostly shrinking.

Many of my favorite talks were about data science education. This included my single favorite talk of either session- Deborah Nolan’s keynote address at useR on “Statistical Thinking in a Data Science Course” (video here):

Just imagine #stats course without simplified scenarios, canned data, non-coding, & always normal distrib #useR2016 pic.twitter.com/qsm7qT4rf2

— Alice Data (@alice_data) June 30, 2016

I’d heard these problems with statistical education described before, but never with this much clarity and evidence.

But alongside that talk, the talks about education at both conferences were uniformly excellent.

.@AmeliaMN shares an HS data science curriculum, including 400-page Introduction to Data Science doc. Wow! #useR2016 https://t.co/Czh2ViaaF4

— David Robinson (@drob) June 28, 2016

Some of the common themes included:

**That programming should be taught alongside statistics, and not as an afterthought**- this was a nearly universal complaint among educators who have had to deal with statistics curricula.**That students should be able to do powerful things immediately**- many of the speakers focused on this point, and this is part of the foundation of my opinion that instructors should teach ggplot2 first. (Jeff Leek was also at JSM, where he doubled down on his dissenting opinion that plotting should be either difficult or ugly).**That we should teach permutation and boostrapping rather than normal theory**- this is a promising way to make statistics more intuitive and focus less on math early on. I did discuss with some people that this is a very frequentist approach to statistical education. What would be an equivalent math-lite Bayesian introduction?

"Randomize, Repeat, Reject"- Deborah Nolan suggests teaching permutation+bootstrap rather than normal theory #useR2016

— David Robinson (@drob) June 30, 2016

**That reproducible research should be a core part of education.**This brings me to another great set of talks:

Like much of the R community, I’ve internalized a lot of the lessons and practices of reproducible research. (For instance, these blog posts are reproducible from R markdown files in this directory)). But it was still great to hear people communicate these messages well, and the reproducibility session chaired by Amelia McNamara was an all-star cast.

.@minebocek makes an important point- requiring reproducibility makes work easier for students, not harder #JSM2016 pic.twitter.com/RU1CWHgbwy

— David Robinson (@drob) August 3, 2016

.@kwbroman's colleague was "sorry you did all that work on incomplete dataset"- reproducibility for the win #JSM2016 pic.twitter.com/cUTR2LRA1J

— David Robinson (@drob) August 3, 2016

I particularly liked Yihui Xie’s idea about teaching reproducibility:

.@xieyihui had "evil" idea to teach reproducibility:

— David Robinson (@drob) August 3, 2016

Week 1: Students analyze a dataset

Week 2: "I updated the data, start over"#JSM2016

(Interestingly, a number of responses were along the lines of “How is that evil, that’s exactly what I’ve been doing!”)

In the same session, Karthik Ram from rOpenSci described JOSS, the Journal of Open Source Software. This is an excellent way to get citations and credit for software, and therefore for scientists to think of software packages (and not just papers) as units of research output.

.@_inundata promotes Journal of Open Source Software, w/ tidytext article by @juliasilge+me as example 🎉🎉 #JSM2016 pic.twitter.com/IOdF9Vt2hK

— David Robinson (@drob) August 3, 2016

Many of my other favorite talks were about making online interactive visualizations.

Screw the dance party; after this #JSM2016 session I sorta want to spend tonight making interactive graphs #rstats pic.twitter.com/18otBYdgLP

— David Robinson (@drob) August 2, 2016

A lot of my knowledge about interactive graphics revolves around Shiny. Shiny’s a terrific tool, but it requires an R backend, which requires some effort and cost for deployment and scaling. I was excited to see what people were up to about building interactive graphics in R that could be deployed entirely in HTML and Javascript.

.@jcheng demos crosstalk #rstats pkg for interactive web graphs- define in R, deploy in Javascript. Wow! #JSM2016 pic.twitter.com/iyEwOBEiiJ

— David Robinson (@drob) August 2, 2016

Ryan Hafen’s rbokeh package is really exciting and something I hadn’t seen before (like others, I thought of Bokeh as a Python visualization package, but it turns out the backend is flexible). Since it plots in an HTML canvas it also doesn’t need an R backend, and I appreciated how the syntax used the `%>%`

pipe and followed the grammar of graphics.

.@hafenstats trying 20 examples of rbokeh in 5 minutes https://t.co/EOWEfeFW0x #useR2016 pic.twitter.com/p2UWAEVqzj

— David Robinson (@drob) June 30, 2016

Carson’s Sievert has taken the “make ggplot2 interactive” conversation to a whole new level, though, with the plotly package, an R interface to the popular Plotly software that among other features includes conversion of ggplot2 objects into interactive plotly graphs.

Teaser for my #JSM2016 on interactive graphics with @plotlygraphs #rstats pic.twitter.com/koAqMg7t8j

— Carson Sievert (@cpsievert) August 1, 2016

I think Yihui Xie might have won the day, though. He not only led a terrific discussion of the previous talks (packed with GIFs, as is his habit), but also demonstrated a Shiny app that does something I’d never seen before in R:

.@xieyihui is customizing graphs with his voice and the room is Flipping. Out. #JSM2016 #rstats pic.twitter.com/HlFOL7rXGK

— David Robinson (@drob) August 2, 2016

.@xieyihui: "Change title to Make America Great Again", graph complies. Statistical applications are clear #JSM2016 https://t.co/1EhesE3s1D

— David Robinson (@drob) August 2, 2016

(Incidentally, I did end up going to the dance party, and still haven’t had the chance to use these interactive graphics for more than toy problems. Looking forward to the right opportunity to use them!)

RStudio has been one of the most influential companies in the modern R world, not only developing the eponymous IDE but supporting many important open source packages, and the company was at both conferences in full force.

RStudio CEO J.J. Allaire talked about the new interactive notebook features of the RStudio IDE, analogous to Jupyter notebooks:

comparison betw R markdown & notebooks in JJ Allaire’s intro to @RStudio’s awesome new Rmd notebooks #UseR2016 pic.twitter.com/4QFp8hhBuu

— Karl Broman (@kwbroman) June 29, 2016

Hadley Wickham gave a useR keynote that pushed for a seismic shift in terminology:

.@hadleywickham proposes we stop saying "Hadleyverse", start saying "tidyverse" #useR2016 #rstats pic.twitter.com/Z4zs4tw2Vn

— David Robinson (@drob) June 29, 2016

He also stepped in for Yihui Xie to introduce Yihui’s terrific bookdown package:

The philosophy behind #rstats bookdown: easy, community-written, freq updated @hadleywickham + @xieyihui #useR2016 pic.twitter.com/WudsaLbHzY

— David Robinson (@drob) June 30, 2016

I was inspired enough by this package that one week later Julia Silge and I started writing the book Tidy Text Mining in R. Bookdown has been a real treat: it’s solved so many of the hassles around creating HTML and PDF manuscripts from knitr.

I was excited to meet RStudio’s Garrett Grolemund for the first time and talk statistics, education and R. He’s the creator of an awesome set of cheat sheets (available online) on R, and grabbing some hard copies at the RStudio booth was a nice bonus.

#useR2016 people: remember to stop by @rstudio booth to pick up awesome cheatsheets by @StatGarrett pic.twitter.com/9KWE9J416S

— David Robinson (@drob) June 30, 2016

Finally, I was crazy excited that RStudio had printed my first set of broom hex stickers:

@drob i have this for you! pic.twitter.com/8fpb2v738x

— Hadley Wickham (@hadleywickham) June 28, 2016

(The stickers are available on stickermule, and I’ll usually have some on hand at conferences and meetups if you run into me).

There are a lot of important ongoing conversations about diversity and inclusion in the R community (e.g. the TaskForce on Women in R, which presented on its findings at useR). But Jonathan Godfrey, a lecturer at Massey University in New Zealand, alerted me to another dimension of diversity I hadn’t considered before.

Jonathan Godfrey showing us how he uses a Braille keyboard to develop R code. Just amazing. #rstats #useR2016 pic.twitter.com/zYUFOeJXmY

— David Robinson (@drob) June 28, 2016

Dr. Godfrey is blind, and along with teaching and consulting on statistics, he develops some tools for helping vision impaired statisticians work with R, such as an in-progress e-book and the BrailleR package. One example of the BrailleR package is converting visualizations so that they could be understood by screen readers or Braille keyboards:

```
library(BrailleR)
x <- rnorm(1000)
VI(hist(x))
```

```
## This is a histogram, with the title: Histogram of x
## "x" is marked on the x-axis.
## Tick marks for the x-axis are at: -3, and 3
## There are a total of 1000 elements for this variable.
## Tick marks for the y-axis are at: 0, 50, 100, and 150
## It has 14 bins with equal widths, starting at -3.5 and ending at 3.5 .
## The mids and counts for the bins are:
## mid = -3.25 count = 3
## mid = -2.75 count = 7
## mid = -2.25 count = 23
## mid = -1.75 count = 39
## mid = -1.25 count = 104
## mid = -0.75 count = 146
## mid = -0.25 count = 186
## mid = 0.25 count = 171
## mid = 0.75 count = 153
## mid = 1.25 count = 102
## mid = 1.75 count = 36
## mid = 2.25 count = 19
## mid = 2.75 count = 6
## mid = 3.25 count = 5
```

Talking to him made me realize what great strides had been made in statistics and programming for the blind (here’s more on that general topic), but also what obstacles remained for R in particular. I take RStudio for granted, but according to Jonathan it’s effectively unusable for blind users (too many buttons, tabs and drop-down menus, which are difficult to navigate with a screenreader). Towards that goal he’s been working on the WriteR IDE for accessible programming in R and R markdown:

Jonathan Godfrey talks WriteR: IDE accessible to blind. Looking for Python-savvy volunteers to contribute #useR2016 pic.twitter.com/HEYum541jE

— David Robinson (@drob) June 29, 2016

(You can find the video of his talk on WriteR, and on the pitfalls of markdown for blind users, here).

I also asked him about the topic of data sonification to replace visualization for blind scientists. Jonathan was very skeptical- among other issues, he pointed out that sight and sound often provide different but complementary channels of information, which is the reason sighted statisticians can find sonification useful. He also noted that he often works with at least one sighted collaborator, so there’s still an opportunity for visualization to surprise someone in ways a model cannot. I don’t know much about the topic and I wonder if there are other perspectives.

Probably my favorite part of visiting a conference is the people I got to see, whether meeting them for the first time or seeing them again.

Table of #useR2016 folks, chatting, as one does, about how much fun it is to be called Dr. pic.twitter.com/goxtTlmNcE

— David Robinson (@drob) July 1, 2016

Aside from the people I’ve already mentioned (and many others), it was great to meet up with the Hopkins/ex-Hopkins Biostatistics crowd.

Four statisticians setting up an eposter

— David Robinson (@drob) July 31, 2016

"Does anyone know how to use Windows?"#JSM2016 @jtleek @acfrazee @hspter pic.twitter.com/d1hfD4ElBi

Hilary Parker chaired an excellent session on data science in industry. I missed Jeff Leek’s talk since mine was at the same time, but our feud did make an appearance in his slides:

😄😄😄 @jtleek s/o to @drob and the actual reasons people use methods #JSM2016 pic.twitter.com/gRHXYgEbjh

— Hilary Parker (@hspter) August 1, 2016

I'm OK with this analogy from @jtleek's talk because I get to be Batman #rstats #JSM2016 pic.twitter.com/nBpoW7qunx

— David Robinson (@drob) August 4, 2016

I was happy to see my former colleague Chee Chen present at JSM on work he and I had done together:

Rather than defining 1 p-value threshold for FDR control, fFDR has threshold vary w/informative variable Z #JSM2016 pic.twitter.com/HQAUAyRrPG

— David Robinson (@drob) August 4, 2016

And see my former adviser John, who has always been a compelling statistical communicator.

This slide by @johnstorey sets great example for math presentation- graphics help audience follow equations #JSM2016 pic.twitter.com/svDULAXZyc

— David Robinson (@drob) August 1, 2016

Among the new people I got to meet were Amelia McNarama and Craig Citro, who gave me a run for my money at talking quickly:

fast_talking_R_mafia <- c(" @drob", "@AmeliaMN", "@craigcitro")#User2016 pic.twitter.com/jLrAzRiPrM

— Karthik Ram (@_inundata) July 1, 2016

There were so many other people and so many other talks I could have mentioned here (if you scroll through my twitter feed under #JSM2016 and #useR2016 you’d see a lot more). Overall I’m very glad I attended both, and that I had the chance to learn from and contribute to the great R and statistics communities. And I’m glad I wasn’t too #serious to share that.

]]>Story of how I once was a #seriousacademic, but Twitter turned me into a #sillydatascientist https://t.co/QGZRJbOUZM

— David Robinson (@drob) August 5, 2016

(This article was first published on ** Variance Explained**, and kindly contributed to R-bloggers)

I was amused by a Guardian article last month that declared “I’m a serious academic, not a professional Instagrammer,” arguing that social media is a distraction for scientific research. This attitude was, to say the least, not popular on academic Twitter, which responded with the #seriousacademic hashtag.

When someone tries to claim that a #seriousacademic should not use twitter… pic.twitter.com/ocL753NFSw

— Kitty Kat, PhD. (@academickitty) August 18, 2016

One part of the article that struck me as especially misguided was the author’s dismissal of conference tweeting:

I see more and more of them live tweeting and hashtagging their way through events…. When did it become acceptable to use your phone throughout a lecture, let alone an entire conference? No matter how good you think you are at multitasking, you will not be truly focusing your attention on the speaker, who has no doubt spent hours preparing for this moment.

I personally haven’t been a #seriousacademic in over a year, having since become a #sillydatascientist. So I felt no shame in live-tweeting the heck out of the two conferences I attended this summer: userR and JSM (Joint Statistical Meetings).

There’s a great analysis here of why people tweet during conferences. For me Twitter best serves as sort of **“public diary”**– I’m not very good at taking notes, and tweeting lets me create a record of what I found most interesting that I can refer to later. So as the summer ends, I’m looking back at my “notes” from these two conferences, and sharing my thoughts.

(If you’re a Serious Academic who is against live-tweeting of conferences, get out now, this post isn’t going to get any better.)

At both conferences I gave a talk on my broom package for tidying model outputs. You can find the slides here, along with the useR video. I was especially impressed by the turnout for my talk at the useR conference, where I got to introduce broom to a large audience.

"Tidy data works until you start doing statistical modeling" – @drob (creator of awesome 'broom' package) #useR2016 pic.twitter.com/DGHwXIakFi

— Mikhail Popov (@bearloga) June 29, 2016

Slides from my #useR2016 talk yesterday are here: https://t.co/7brRt7eBV0 #rstats pic.twitter.com/24BOG0DHdl

— David Robinson (@drob) June 29, 2016

At JSM I also got to present an e-poster about some of the analysis of software developers I’ve done at Stack Overflow:

#JSM2016 folks: come see my e-poster on what @StackOverflow data tells us about landscape of software development pic.twitter.com/nkzcOYZsBK

— David Robinson (@drob) August 1, 2016

You can find the full slides here. The short version is that we can cluster programming languages based on which get used together:

This also serves as a useful 2-dimensional layout for comparing technologies. For example, we could visualize which areas of the developer landscape are currently growing (red) or shrinking (blue), in terms of % of Stack Overflow questions:

For instance, we can see that the data science cluster (including R and machine-learning) is growing in its share of Stack Overflow questions, while the C/C++ cluster below it is mostly shrinking.

Many of my favorite talks were about data science education. This included my single favorite talk of either session- Deborah Nolan’s keynote address at useR on “Statistical Thinking in a Data Science Course” (video here):

Just imagine #stats course without simplified scenarios, canned data, non-coding, & always normal distrib #useR2016 pic.twitter.com/qsm7qT4rf2

— Alice Data (@alice_data) June 30, 2016

I’d heard these problems with statistical education described before, but never with this much clarity and evidence.

But alongside that talk, the talks about education at both conferences were uniformly excellent.

.@AmeliaMN shares an HS data science curriculum, including 400-page Introduction to Data Science doc. Wow! #useR2016 https://t.co/Czh2ViaaF4

— David Robinson (@drob) June 28, 2016

Some of the common themes included:

**That programming should be taught alongside statistics, and not as an afterthought**– this was a nearly universal complaint among educators who have had to deal with statistics curricula.**That students should be able to do powerful things immediately**– many of the speakers focused on this point, and this is part of the foundation of my opinion that instructors should teach ggplot2 first. (Jeff Leek was also at JSM, where he doubled down on his dissenting opinion that plotting should be either difficult or ugly).**That we should teach permutation and boostrapping rather than normal theory**– this is a promising way to make statistics more intuitive and focus less on math early on. I did discuss with some people that this is a very frequentist approach to statistical education. What would be an equivalent math-lite Bayesian introduction?

"Randomize, Repeat, Reject"- Deborah Nolan suggests teaching permutation+bootstrap rather than normal theory #useR2016

— David Robinson (@drob) June 30, 2016

**That reproducible research should be a core part of education.**This brings me to another great set of talks:

Like much of the R community, I’ve internalized a lot of the lessons and practices of reproducible research. (For instance, these blog posts are reproducible from R markdown files in this directory)). But it was still great to hear people communicate these messages well, and the reproducibility session chaired by Amelia McNamara was an all-star cast.

.@minebocek makes an important point- requiring reproducibility makes work easier for students, not harder #JSM2016 pic.twitter.com/RU1CWHgbwy

— David Robinson (@drob) August 3, 2016

.@kwbroman's colleague was "sorry you did all that work on incomplete dataset"- reproducibility for the win #JSM2016 pic.twitter.com/cUTR2LRA1J

— David Robinson (@drob) August 3, 2016

I particularly liked Yihui Xie’s idea about teaching reproducibility:

.@xieyihui had "evil" idea to teach reproducibility:

Week 1: Students analyze a dataset

Week 2: "I updated the data, start over"#JSM2016

— David Robinson (@drob) August 3, 2016

(Interestingly, a number of responses were along the lines of “How is that evil, that’s exactly what I’ve been doing!”)

In the same session, Karthik Ram from rOpenSci described JOSS, the Journal of Open Source Software. This is an excellent way to get citations and credit for software, and therefore for scientists to think of software packages (and not just papers) as units of research output.

.@_inundata promotes Journal of Open Source Software, w/ tidytext article by @juliasilge+me as example #JSM2016 pic.twitter.com/IOdF9Vt2hK

— David Robinson (@drob) August 3, 2016

Many of my other favorite talks were about making online interactive visualizations.

Screw the dance party; after this #JSM2016 session I sorta want to spend tonight making interactive graphs #rstats pic.twitter.com/18otBYdgLP

— David Robinson (@drob) August 2, 2016

A lot of my knowledge about interactive graphics revolves around Shiny. Shiny’s a terrific tool, but it requires an R backend, which requires some effort and cost for deployment and scaling. I was excited to see what people were up to about building interactive graphics in R that could be deployed entirely in HTML and Javascript.

.@jcheng demos crosstalk #rstats pkg for interactive web graphs- define in R, deploy in Javascript. Wow! #JSM2016 pic.twitter.com/iyEwOBEiiJ

— David Robinson (@drob) August 2, 2016

Ryan Hafen’s rbokeh package is really exciting and something I hadn’t seen before (like others, I thought of Bokeh as a Python visualization package, but it turns out the backend is flexible). Since it plots in an HTML canvas it also doesn’t need an R backend, and I appreciated how the syntax used the `%>%`

pipe and followed the grammar of graphics.

.@hafenstats trying 20 examples of rbokeh in 5 minutes https://t.co/EOWEfeFW0x #useR2016 pic.twitter.com/p2UWAEVqzj

— David Robinson (@drob) June 30, 2016

Carson’s Sievert has taken the “make ggplot2 interactive” conversation to a whole new level, though, with the plotly package, an R interface to the popular Plotly software that among other features includes conversion of ggplot2 objects into interactive plotly graphs.

Teaser for my #JSM2016 on interactive graphics with @plotlygraphs #rstats pic.twitter.com/koAqMg7t8j

— Carson Sievert (@cpsievert) August 1, 2016

I think Yihui Xie might have won the day, though. He not only led a terrific discussion of the previous talks (packed with GIFs, as is his habit), but also demonstrated a Shiny app that does something I’d never seen before in R:

.@xieyihui is customizing graphs with his voice and the room is Flipping. Out. #JSM2016 #rstats pic.twitter.com/HlFOL7rXGK

— David Robinson (@drob) August 2, 2016

.@xieyihui: "Change title to Make America Great Again", graph complies. Statistical applications are clear #JSM2016 https://t.co/1EhesE3s1D

— David Robinson (@drob) August 2, 2016

(Incidentally, I did end up going to the dance party, and *still* haven’t had the chance to use these interactive graphics for more than toy problems. Looking forward to the right opportunity to use them!)

RStudio has been one of the most influential companies in the modern R world, not only developing the eponymous IDE but supporting many important open source packages, and the company was at both conferences in full force.

RStudio CEO J.J. Allaire talked about the new interactive notebook features of the RStudio IDE, analogous to Jupyter notebooks:

comparison betw R markdown & notebooks in JJ Allaire’s intro to @RStudio’s awesome new Rmd notebooks #UseR2016 pic.twitter.com/4QFp8hhBuu

— Karl Broman (@kwbroman) June 29, 2016

Hadley Wickham gave a useR keynote that pushed for a seismic shift in terminology:

.@hadleywickham proposes we stop saying "Hadleyverse", start saying "tidyverse" #useR2016 #rstats pic.twitter.com/Z4zs4tw2Vn

— David Robinson (@drob) June 29, 2016

He also stepped in for Yihui Xie to introduce Yihui’s terrific bookdown package:

The philosophy behind #rstats bookdown: easy, community-written, freq updated @hadleywickham + @xieyihui #useR2016 pic.twitter.com/WudsaLbHzY

— David Robinson (@drob) June 30, 2016

I was inspired enough by this package that one week later Julia Silge and I started writing the book Tidy Text Mining in R. Bookdown has been a real treat: it’s solved so many of the hassles around creating HTML and PDF manuscripts from knitr.

I was excited to meet RStudio’s Garrett Grolemund for the first time and talk statistics, education and R. He’s the creator of an awesome set of cheat sheets (available online) on R, and grabbing some hard copies at the RStudio booth was a nice bonus.

#useR2016 people: remember to stop by @rstudio booth to pick up awesome cheatsheets by @StatGarrett pic.twitter.com/9KWE9J416S

— David Robinson (@drob) June 30, 2016

Finally, I was *crazy* excited that RStudio had printed my first set of broom hex stickers:

@drob i have this for you! pic.twitter.com/8fpb2v738x

— Hadley Wickham (@hadleywickham) June 28, 2016

(The stickers are available on stickermule, and I’ll usually have some on hand at conferences and meetups if you run into me).

There are a lot of important ongoing conversations about diversity and inclusion in the R community (e.g. the TaskForce on Women in R, which presented on its findings at useR). But Jonathan Godfrey, a lecturer at Massey University in New Zealand, alerted me to another dimension of diversity I hadn’t considered before.

Jonathan Godfrey showing us how he uses a Braille keyboard to develop R code. Just amazing. #rstats #useR2016 pic.twitter.com/zYUFOeJXmY

— David Robinson (@drob) June 28, 2016

Dr. Godfrey is blind, and along with teaching and consulting on statistics, he develops some tools for helping vision impaired statisticians work with R, such as an in-progress e-book and the BrailleR package. One example of the BrailleR package is converting visualizations so that they could be understood by screen readers or Braille keyboards:

Talking to him made me realize what great strides had been made in statistics and programming for the blind (here’s more on that general topic), but also what obstacles remained for R in particular. I take RStudio for granted, but according to Jonathan it’s effectively unusable for blind users (too many buttons, tabs and drop-down menus, which are difficult to navigate with a screenreader). Towards that goal he’s been working on the WriteR IDE for accessible programming in R and R markdown:

Jonathan Godfrey talks WriteR: IDE accessible to blind. Looking for Python-savvy volunteers to contribute #useR2016 pic.twitter.com/HEYum541jE

— David Robinson (@drob) June 29, 2016

(You can find the video of his talk on WriteR, and on the pitfalls of markdown for blind users, here).

I also asked him about the topic of data sonification to replace visualization for blind scientists. Jonathan was very skeptical- among other issues, he pointed out that sight and sound often provide different but complementary channels of information, which is the reason sighted statisticians can find sonification useful. He also noted that he often works with at least one sighted collaborator, so there’s still an opportunity for visualization to surprise someone in ways a model cannot. I don’t know much about the topic and I wonder if there are other perspectives.

Probably my favorite part of visiting a conference is the people I got to see, whether meeting them for the first time or seeing them again.

Table of #useR2016 folks, chatting, as one does, about how much fun it is to be called Dr. pic.twitter.com/goxtTlmNcE

— David Robinson (@drob) July 1, 2016

Aside from the people I’ve already mentioned (and many others), it was great to meet up with the Hopkins/ex-Hopkins Biostatistics crowd.

Four statisticians setting up an eposter

"Does anyone know how to use Windows?"#JSM2016 @jtleek @acfrazee @hspter pic.twitter.com/d1hfD4ElBi

— David Robinson (@drob) July 31, 2016

Hilary Parker chaired an excellent session on data science in industry. I missed Jeff Leek’s talk since mine was at the same time, but our feud did make an appearance in his slides:

@jtleek s/o to @drob and the actual reasons people use methods #JSM2016 pic.twitter.com/gRHXYgEbjh

— Hilary Parker (@hspter) August 1, 2016

I'm OK with this analogy from @jtleek's talk because I get to be Batman #rstats #JSM2016 pic.twitter.com/nBpoW7qunx

— David Robinson (@drob) August 4, 2016

I was happy to see my former colleague Chee Chen present at JSM on work he and I had done together:

Rather than defining 1 p-value threshold for FDR control, fFDR has threshold vary w/informative variable Z #JSM2016 pic.twitter.com/HQAUAyRrPG

— David Robinson (@drob) August 4, 2016

And see my former adviser John, who has always been a compelling statistical communicator.

This slide by @johnstorey sets great example for math presentation- graphics help audience follow equations #JSM2016 pic.twitter.com/svDULAXZyc

— David Robinson (@drob) August 1, 2016

Among the new people I got to meet were Amelia McNarama and Craig Citro, who gave me a run for my money at talking quickly:

fast_talking_R_mafia <- c(" @drob", "@AmeliaMN", "@craigcitro")#User2016 pic.twitter.com/jLrAzRiPrM

— Karthik Ram (@_inundata) July 1, 2016

There were so many other people and so many other talks I could have mentioned here (if you scroll through my twitter feed under #JSM2016 and #useR2016 you’d see a lot more). Overall I’m very glad I attended both, and that I had the chance to learn from and contribute to the great R and statistics communities. And I’m glad I wasn’t too #serious to share that.

Story of how I once was a #seriousacademic, but Twitter turned me into a #sillydatascientist https://t.co/QGZRJbOUZM

— David Robinson (@drob) August 5, 2016

To **leave a comment** for the author, please follow the link and comment on their blog: ** Variance Explained**.

R-bloggers.com offers

(This article was first published on ** Mood Stochastic**, and kindly contributed to R-bloggers)

Version 0.1-3 of the Rborist Random Forest package can now be downloaded

from CRAN. This version follows closely on the short-lived 0.1-2, which failed

to install on Solaris.

The new version features incremental performance improvements, as well as

internal changes needed to support a sparse representation envisioned for the

next release. All reported bugs have been repaired, including a problem with

small forest sizes revealed by the Caret package.

Thanks are due to a number of people whose comments and contributions helped

ensure a good release. These include Carlos Ortega, Christopher Brown, junwen

Huang, David Shaub and the Caret team.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mood Stochastic**.

R-bloggers.com offers

(This article was first published on ** Mango Solutions » R Blog**, and kindly contributed to R-bloggers)

On September 13th-15th Mango Solutions are running the EARL ( Effective Applications of the R Language) Conference for all users, enthusiasts and beginners of the R programming language.

It is an event not to be missed and here are the Top 10 Reasons why…

**1. Amazing Venue**

The Tower Hotel In London with it’s amazing view is the venue again for this years EARL Conference. Nestled between the River Thames and St Katharine’s Dock and alongside two world Heritage Sites – Tower Bridge and the Tower of London, the Tower Hotel is within easy reach of key places in London.

**2. Three Streams of Expert Speakers**

With speakers from companies like Sygnenta, eBay, FT, Worldpay, Oracle, The British Museum, Microsoft and Capital One to name a few, you can expect an extremely high level of speakers. With three presentation streams you will be spoilt for choice.

Check out the full agenda here

**3. Six Keynote Speakers**

With so much experience between them the keynote speeches on Wednesday and Thursday are not to be missed.

Kenneth Cukier – The Economist @kncukier

Garrett Grolemund – RStudio @StatGarrett

David Smith – Revolution Analytics @revodavid

Joe Cheng – RStudio @jcheng

Lou Bajuk-Yorgan – TIBCO Spotfire @LouBajuk

Gabor Csardi – Mango Solutions @GaborCsardi

**4. Six Great Workshops**

With one day of Workshops and two days devoted to the most innovative R implementations by the world’s leading practitioners. *Due to high demand some of our workshops are now sold out! *

The workshops **which are also open to non-conference attendees** are as follows:

Workshop 1: Advanced Shiny Workshop – Full day **SOLD OUT!**

Workshop 2: A Crash Course in R – Full day **SOLD OUT!**

Workshop 3: Introduction to ggplot2 – Half day

Workshop 4: Using R with Microsoft Office Products – Half day **SOLD OUT!**

Workshop 5: Getting Started with Shiny – Half day

Workshop 6: Package Development in R – Half day **SOLD OUT!**

**5. The Exclusive Conference Reception at The Tower of London**

The Conference Main Evening Reception will take place on Wednesday 14th September in the historic Tower of London. An event not to be missed, it includes a private tour of the Tower and Crown Jewels led by the famous Beefeaters, champagne on arrival and canapes and drinks throughout the evening.

The reception will take place in the White Tower where guests can view Henry VIII’s suits of armour and see where the remains of the Princes in the Tower were discovered. Please note there is limited availability for this event so reserve your ticket now.

We are grateful to Microsoft for sponsoring this amazing event.

** **

**6. Varied Range of Topics on the Agenda**

With over 48 presentations, you are sure to find many that suit your needs. See the full agenda here. Here are some of them:

**7. Two Panel Discussions**

For the first time this year at the end slot of Wednesday and Thursday there will be 2 panel discussions in one of the streams. The discussions are:

**Creating a Corporate R Infrastructure****Fostering an R Culture in a Commercial Organization**

To find out more about who will be taking part, please see the Agenda.

**8. ****Networking**

The EARL Conference offers you a great chance to network with a diverse range of R people from a wide range of industries with different levels of expertise. With lots of breaks throughout the days and 2 drinks receptions there is lots of opportunity to meet with like-minded professionals.

**9. For your Personal Growth and Development**

It goes without saying that you are going to learn so much at the conference.

**10. ****Look at the statistics from last years EARL London Conference**

70% of people who took the survey and who attended EARL LONDON 2015 said that they would ** Probably, Very Probably or Definitely** attend this years EARL Conference with 19% of people saying they

To get your ticket to this great R Conference, then click here

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mango Solutions » R Blog**.

R-bloggers.com offers

(This article was first published on ** R – Tech and Mortals**, and kindly contributed to R-bloggers)

*Developer*: Akash Tandon

*Mentors*: Joshua Ulrich, Toby Dylan Hocking

*Official Project Link*: Rperform: performance analysis of R package code

This project meant to deal primarily with development of Rperform’s functionalities to allow developers to obtain potential performance impacts of a pull request (PR) without having to merge, extension of the package’s existing performance metric measurement and visualization functions, and development of a coherent user interface for developers to interact with.

Rperform is a package that allows R developers to track and visualize quantitative performance metrics of their code.

It focuses on providing changes in a package’s performance metrics, related to runtime and memory, over different git versions and across git branches. Rperform can be integrated with Travis-CI to do performance testing during Travis builds by making changes to the repo’s .travis.yml file. **It can prove to be particularly useful while measuring the possible changes which can be introduced by a pull request (PR).**

More information about the package can be found on its Github Wiki.

- Commits made to Rperform’s master branch can be accessed here.
- Created a test and demo package for Rperform. It can be found here. List of 8 commits made to this package during GSoC 2016 is given at the end of this document.
- Created the tutorials, Using Rperform with packages and Using Rperform with Travis CI.
- Created the wiki page, Obtaining package performance data using Rperform, and updated the wiki pages, Integrating Rperform with Travis CI and Plotting package metrics with Rperform.

*Visualization*: Implemented plot_branchmetrics() and its helper functions. This function can be used to compare code performance across two branches. More information about the same can be found here.*Travis CI integration*: Wrote functions and shell scripts to allow for performance testing using Rperform during a package repo’s Travis CI builds. More information about using Rperform with Travis CI can be found by going through this wiki page and this tutorial.*Refactoring*: Refactored portions of existing code.*Documentation*: Updated the R package’s documentation and Github wiki.

*Interactive visualizations*: Work was done towards implementing interactive visualizations in Rperform. The animint package was used for the same. However, more work needs to be done.*User Interface*: As of now, the webpages containing results from performance tests during Travis builds are static. Work needs to be done towards implementing a coherent UI.

**Note:
This report is also available as a Github gist.
If you read and liked the report, sharing it would be a good next step.
Drop me a mail, or hit me up on Twitter or Quora in case you want to get in touch.**

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Tech and Mortals**.

R-bloggers.com offers

OK, maybe residuals aren’t the sexiest topic in the world. Still, they’re an essential element and means for identifying potential problems of any statistical model. For example, the residuals from a linear regression model should be homoscedastic. If not, this indicates an issue with the model such as non-linearity in the data.

This post will cover various methods for visualising residuals from regression-based models. Here are some examples of the visualisations that we’ll be creating:

To get the most out of this post, there are a few things you should be aware of. Firstly, if you’re unfamiliar with the meaning of residuals, or what seems to be going on here, I’d recommend that you first do some introductory reading on the topic. Some places to get started are Wikipedia and this excellent section on Statwing.

You’ll also need to be familiar with running regression (linear and logistic) in R, and using the following packages: ggplot2 to produce all graphics, and dplyr and tidyr to do data manipulation. In most cases, you should be able to follow along with each step, but it will help if you’re already familiar with these.

Before diving in, it’s good to remind ourselves of the default options that R has for visualising residuals. Most notably, we can directly `plot()`

a fitted regression model. For example, using the `mtcars`

data set, let’s regress the number of miles per gallon for each car (`mpg`

) on their horsepower (`hp`

) and visualise information about the model and residuals:

```
fit <- lm(mpg ~ hp, data = mtcars) # Fit the model
summary(fit) # Report the results
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.7121 -2.1122 -0.8854 1.5819 8.2360
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
#> hp -0.06823 0.01012 -6.742 1.79e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
#> F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(fit) # Plot the model information
```

```
par(mfrow = c(1, 1)) # Return plotting panel to 1 section
```

These plots provide a traditional method to interpret residual terms and determine whether there might be problems with our model. We’ll now be thinking about how to supplement these with some alternative (and more visually appealing) graphics.

The general approach behind each of the examples that we’ll cover below is to:

- Fit a regression model to predict variable (Y).
- Obtain the predicted and residual values associated with each observation on (Y).
- Plot the actual and predicted values of (Y) so that they are distinguishable, but connected.
- Use the residuals to make an aesthetic adjustment (e.g. red colour when residual in very high) to highlight points which are poorly predicted by the model.

We’ll start with simple linear regression, which is when we regress one variable on just one other. We can take the earlier example, where we regressed miles per gallon on horsepower.

First, we will fit our model. In this instance, let’s copy the `mtcars`

dataset to a new object `d`

so we can manipulate it later:

```
d <- mtcars
fit <- lm(mpg ~ hp, data = d)
```

Next, we want to get predicted and residual values to add supplementary information to this graph. We can do this as follows:

```
d$predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
# Quick look at the actual, predicted, and residual values
library(dplyr)
d %>% select(mpg, predicted, residuals) %>% head()
#> mpg predicted residuals
#> Mazda RX4 21.0 22.59375 -1.5937500
#> Mazda RX4 Wag 21.0 22.59375 -1.5937500
#> Datsun 710 22.8 23.75363 -0.9536307
#> Hornet 4 Drive 21.4 22.59375 -1.1937500
#> Hornet Sportabout 18.7 18.15891 0.5410881
#> Valiant 18.1 22.93489 -4.8348913
```

Looking good so far.

Plotting these values takes a couple of intermediate steps. First, we plot our actual data as follows:

```
library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) + # Set up canvas with outcome variable on y-axis
geom_point() # Plot the actual points
```

Next, we plot the predicted values in a way that they’re distinguishable from the actual values. For example, let’s change their shape:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_point() +
geom_point(aes(y = predicted), shape = 1) # Add the predicted values
```

This is on track, but it’s difficult to see how our actual and predicted values are related. Let’s connect the actual data points with their corresponding predicted value using `geom_segment()`

:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted)) +
geom_point() +
geom_point(aes(y = predicted), shape = 1)
```

We’ll make a few final adjustments:

- Clean up the overall look with
`theme_bw()`

. - Fade out connection lines by adjusting their
`alpha`

. - Add the regression slope with
`geom_smooth()`

:

```
library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = predicted), shape = 1) +
theme_bw() # Add theme for cleaner look
```

Finally, we want to make an adjustment to highlight the size of the residual. There are MANY options. To make comparisons easy, I’ll make adjustments to the actual values, but you could just as easily apply these, or other changes, to the predicted values. Here are a few examples building on the previous plot:

```
# ALPHA
# Changing alpha of actual values based on absolute value of residuals
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Alpha adjustments made here...
geom_point(aes(alpha = abs(residuals))) + # Alpha mapped to abs(residuals)
guides(alpha = FALSE) + # Alpha legend removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# COLOR
# High residuals (in abolsute terms) made more red on actual values.
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color adjustments made here...
geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)
scale_color_continuous(low = "black", high = "red") + # Colors to use here
guides(color = FALSE) + # Color legend removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# SIZE AND COLOR
# Same coloring as above, size corresponding as well
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color AND size adjustments made here...
geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped
scale_color_continuous(low = "black", high = "red") +
guides(color = FALSE, size = FALSE) + # Size legend also removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color adjustments made here...
geom_point(aes(color = residuals)) + # Color mapped here
scale_color_gradient2(low = "blue", mid = "white", high = "red") + # Colors to use here
guides(color = FALSE) +
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

I particularly like this last example, because the colours nicely help to identify non-linearity in the data. For example, we can see that there is more red for extreme values of `hp`

where the actual values are greater than what is being predicted. There is more blue in the centre, however, indicating that the actual values are less than what is being predicted. Together, this suggests that the relationship between the variables is non-linear, and might be better modelled by including a quadratic term in the regression equation.

Let’s crank up the complexity and get into multiple regression, where we regress one variable on two or more others. For this example, we’ll regress Miles per gallon (`mpg`

) on horsepower (`hp`

), weight (`wt`

), and displacement (`disp`

).

```
# Select out data of interest:
d <- mtcars %>% select(mpg, hp, wt, disp)
# Fit the model
fit <- lm(mpg ~ hp + wt+ disp, data = d)
# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)
head(d)
#> mpg hp wt disp predicted residuals
#> Mazda RX4 21.0 110 2.620 160 23.57003 -2.5700299
#> Mazda RX4 Wag 21.0 110 2.875 160 22.60080 -1.6008028
#> Datsun 710 22.8 93 2.320 108 25.28868 -2.4886829
#> Hornet 4 Drive 21.4 110 3.215 258 21.21667 0.1833269
#> Hornet Sportabout 18.7 175 3.440 360 18.24072 0.4592780
#> Valiant 18.1 105 3.460 225 20.47216 -2.3721590
```

Let’s create a relevant plot using ONE of our predictors, horsepower (`hp`

). Again, we’ll start by plotting the actual and predicted values. In this case, plotting the regression slope is a little more complicated, so we’ll exclude it to stay on focus.

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # Lines to connect points
geom_point() + # Points of actual values
geom_point(aes(y = predicted), shape = 1) + # Points of predicted values
theme_bw()
```

Again, we can make all sorts of adjustments using the residual values. Let’s apply the same changes as the last plot above - with blue or red for actual values that are greater or less than their predicted values:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

So far, there’s not anything new in our code. All that has changed in that the predicted values don’t line up neatly because we’re now doing multiple regression.

Plotting one independent variable is all well and good, but the whole point of multiple regression is to investigate multiple variables!

To visualise this, we’ll make use of one of my favourite tricks: using the tidyr package to `gather()`

our independent variable columns, and then use `facet_*()`

in our ggplot to split them into separate panels. For relevant examples, see here, here, or here.

Let’s recreate the last example plot, but separately for each of our predictor variables.

```
d %>%
gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>% # Get data into shape
ggplot(aes(x = x, y = mpg)) + # Note use of `x` here and next line
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_grid(~ iv, scales = "free") + # Split panels here by `iv`
theme_bw()
```

Let’s try this out with another data set. We’ll use the `iris`

data set, and regress `Sepal.Width`

on all other variables (including the categorical variable, `species`

):

```
d <- iris
# Fit the model
fit <- lm(Sepal.Width ~ ., data = iris)
# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)
# Create plot
d %>%
gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%
ggplot(aes(x = x, y = Sepal.Width)) +
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_grid(~ iv, scales = "free") +
theme_bw()
```

To make this plot, after the regression, the only change to our previous code was to change `mpg`

to `Sepal.Width`

in two places: the `gather()`

and `ggplot()`

lines.

We can now see how the actual and predicted values compare across our predictor variables. In case you’d forgotten, the coloured points are the actual data, and the white circles are the predicted values. With this in mind, we can see, as expected, that there is less variability in the predicted values than the actual values. It also appears that the sepal width of the setosa species is not as well accounted for as the other species.

To round this post off, let’s extend our approach to logistic regression. It’s going to require the same basic workflow, but we will need to extract predicted and residual values for the responses. Here’s an example predicting V/S (`vs`

), which is 0 or 1, with `hp`

:

```
# Step 1: Fit the data
d <- mtcars
fit <- glm(vs ~ hp, family = binomial(), data = d)
# Step 2: Obtain predicted and residuals
d$predicted <- predict(fit, type="response")
d$residuals <- residuals(fit, type = "response")
# Steps 3 and 4: plot the results
ggplot(d, aes(x = hp, y = vs)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

If we only want to flag cases that would be scored as the incorrect category, we can do something like the following (with some help from the dplyr function, `filter()`

):

```
ggplot(d, aes(x = hp, y = vs)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point() +
# > This plots large red circle on misclassified points
geom_point(data = d %>% filter(vs != round(predicted)),
color = "red", size = 2) +
# <
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

I’ll leave it to you to combine this with instructions from the previous sections if you’d like to extend it to multiple logistic regression. But, hopefully, you should now have a good idea of the steps involved and how to create these residual visualisations!

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

]]>
(This article was first published on ** blogR**, and kindly contributed to R-bloggers)

Residuals. Now there’s something to get you out of bed in the morning!

OK, maybe residuals aren’t the sexiest topic in the world. Still, they’re an essential element and means for identifying potential problems of any statistical model. For example, the residuals from a linear regression model should be homoscedastic. If not, this indicates an issue with the model such as non-linearity in the data.

This post will cover various methods for visualising residuals from regression-based models. Here are some examples of the visualisations that we’ll be creating:

To get the most out of this post, there are a few things you should be aware of. Firstly, if you’re unfamiliar with the meaning of residuals, or what seems to be going on here, I’d recommend that you first do some introductory reading on the topic. Some places to get started are Wikipedia and this excellent section on Statwing.

You’ll also need to be familiar with running regression (linear and logistic) in R, and using the following packages: ggplot2 to produce all graphics, and dplyr and tidyr to do data manipulation. In most cases, you should be able to follow along with each step, but it will help if you’re already familiar with these.

Before diving in, it’s good to remind ourselves of the default options that R has for visualising residuals. Most notably, we can directly `plot()`

a fitted regression model. For example, using the `mtcars`

data set, let’s regress the number of miles per gallon for each car (`mpg`

) on their horsepower (`hp`

) and visualise information about the model and residuals:

```
fit <- lm(mpg ~ hp, data = mtcars) # Fit the model
summary(fit) # Report the results
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.7121 -2.1122 -0.8854 1.5819 8.2360
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
#> hp -0.06823 0.01012 -6.742 1.79e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
#> F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(fit) # Plot the model information
```

```
par(mfrow = c(1, 1)) # Return plotting panel to 1 section
```

These plots provide a traditional method to interpret residual terms and determine whether there might be problems with our model. We’ll now be thinking about how to supplement these with some alternative (and more visually appealing) graphics.

The general approach behind each of the examples that we’ll cover below is to:

- Fit a regression model to predict variable (Y).
- Obtain the predicted and residual values associated with each observation on (Y).
- Plot the actual and predicted values of (Y) so that they are distinguishable, but connected.
- Use the residuals to make an aesthetic adjustment (e.g. red colour when residual in very high) to highlight points which are poorly predicted by the model.

We’ll start with simple linear regression, which is when we regress one variable on just one other. We can take the earlier example, where we regressed miles per gallon on horsepower.

First, we will fit our model. In this instance, let’s copy the `mtcars`

dataset to a new object `d`

so we can manipulate it later:

```
d <- mtcars
fit <- lm(mpg ~ hp, data = d)
```

Next, we want to get predicted and residual values to add supplementary information to this graph. We can do this as follows:

```
d$predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
# Quick look at the actual, predicted, and residual values
library(dplyr)
d %>% select(mpg, predicted, residuals) %>% head()
#> mpg predicted residuals
#> Mazda RX4 21.0 22.59375 -1.5937500
#> Mazda RX4 Wag 21.0 22.59375 -1.5937500
#> Datsun 710 22.8 23.75363 -0.9536307
#> Hornet 4 Drive 21.4 22.59375 -1.1937500
#> Hornet Sportabout 18.7 18.15891 0.5410881
#> Valiant 18.1 22.93489 -4.8348913
```

Looking good so far.

Plotting these values takes a couple of intermediate steps. First, we plot our actual data as follows:

```
library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) + # Set up canvas with outcome variable on y-axis
geom_point() # Plot the actual points
```

Next, we plot the predicted values in a way that they’re distinguishable from the actual values. For example, let’s change their shape:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_point() +
geom_point(aes(y = predicted), shape = 1) # Add the predicted values
```

This is on track, but it’s difficult to see how our actual and predicted values are related. Let’s connect the actual data points with their corresponding predicted value using `geom_segment()`

:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted)) +
geom_point() +
geom_point(aes(y = predicted), shape = 1)
```

We’ll make a few final adjustments:

- Clean up the overall look with
`theme_bw()`

. - Fade out connection lines by adjusting their
`alpha`

. - Add the regression slope with
`geom_smooth()`

:

```
library(ggplot2)
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = predicted), shape = 1) +
theme_bw() # Add theme for cleaner look
```

Finally, we want to make an adjustment to highlight the size of the residual. There are MANY options. To make comparisons easy, I’ll make adjustments to the actual values, but you could just as easily apply these, or other changes, to the predicted values. Here are a few examples building on the previous plot:

```
# ALPHA
# Changing alpha of actual values based on absolute value of residuals
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Alpha adjustments made here...
geom_point(aes(alpha = abs(residuals))) + # Alpha mapped to abs(residuals)
guides(alpha = FALSE) + # Alpha legend removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# COLOR
# High residuals (in abolsute terms) made more red on actual values.
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color adjustments made here...
geom_point(aes(color = abs(residuals))) + # Color mapped to abs(residuals)
scale_color_continuous(low = "black", high = "red") + # Colors to use here
guides(color = FALSE) + # Color legend removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# SIZE AND COLOR
# Same coloring as above, size corresponding as well
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color AND size adjustments made here...
geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size also mapped
scale_color_continuous(low = "black", high = "red") +
guides(color = FALSE, size = FALSE) + # Size legend also removed
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

```
# COLOR UNDER/OVER
# Color mapped to residual with sign taken into account.
# i.e., whether actual value is greater or less than predicted
ggplot(d, aes(x = hp, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
# > Color adjustments made here...
geom_point(aes(color = residuals)) + # Color mapped here
scale_color_gradient2(low = "blue", mid = "white", high = "red") + # Colors to use here
guides(color = FALSE) +
# <
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

I particularly like this last example, because the colours nicely help to identify non-linearity in the data. For example, we can see that there is more red for extreme values of `hp`

where the actual values are greater than what is being predicted. There is more blue in the centre, however, indicating that the actual values are less than what is being predicted. Together, this suggests that the relationship between the variables is non-linear, and might be better modelled by including a quadratic term in the regression equation.

Let’s crank up the complexity and get into multiple regression, where we regress one variable on two or more others. For this example, we’ll regress Miles per gallon (`mpg`

) on horsepower (`hp`

), weight (`wt`

), and displacement (`disp`

).

```
# Select out data of interest:
d <- mtcars %>% select(mpg, hp, wt, disp)
# Fit the model
fit <- lm(mpg ~ hp + wt+ disp, data = d)
# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)
head(d)
#> mpg hp wt disp predicted residuals
#> Mazda RX4 21.0 110 2.620 160 23.57003 -2.5700299
#> Mazda RX4 Wag 21.0 110 2.875 160 22.60080 -1.6008028
#> Datsun 710 22.8 93 2.320 108 25.28868 -2.4886829
#> Hornet 4 Drive 21.4 110 3.215 258 21.21667 0.1833269
#> Hornet Sportabout 18.7 175 3.440 360 18.24072 0.4592780
#> Valiant 18.1 105 3.460 225 20.47216 -2.3721590
```

Let’s create a relevant plot using ONE of our predictors, horsepower (`hp`

). Again, we’ll start by plotting the actual and predicted values. In this case, plotting the regression slope is a little more complicated, so we’ll exclude it to stay on focus.

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) + # Lines to connect points
geom_point() + # Points of actual values
geom_point(aes(y = predicted), shape = 1) + # Points of predicted values
theme_bw()
```

Again, we can make all sorts of adjustments using the residual values. Let’s apply the same changes as the last plot above – with blue or red for actual values that are greater or less than their predicted values:

```
ggplot(d, aes(x = hp, y = mpg)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

So far, there’s not anything new in our code. All that has changed in that the predicted values don’t line up neatly because we’re now doing multiple regression.

Plotting one independent variable is all well and good, but the whole point of multiple regression is to investigate multiple variables!

To visualise this, we’ll make use of one of my favourite tricks: using the tidyr package to `gather()`

our independent variable columns, and then use `facet_*()`

in our ggplot to split them into separate panels. For relevant examples, see here, here, or here.

Let’s recreate the last example plot, but separately for each of our predictor variables.

```
d %>%
gather(key = "iv", value = "x", -mpg, -predicted, -residuals) %>% # Get data into shape
ggplot(aes(x = x, y = mpg)) + # Note use of `x` here and next line
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_grid(~ iv, scales = "free") + # Split panels here by `iv`
theme_bw()
```

Let’s try this out with another data set. We’ll use the `iris`

data set, and regress `Sepal.Width`

on all other variables (including the categorical variable, `species`

):

```
d <- iris
# Fit the model
fit <- lm(Sepal.Width ~ ., data = iris)
# Obtain predicted and residual values
d$predicted <- predict(fit)
d$residuals <- residuals(fit)
# Create plot
d %>%
gather(key = "iv", value = "x", -Sepal.Width, -predicted, -residuals) %>%
ggplot(aes(x = x, y = Sepal.Width)) +
geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_grid(~ iv, scales = "free") +
theme_bw()
```

To make this plot, after the regression, the only change to our previous code was to change `mpg`

to `Sepal.Width`

in two places: the `gather()`

and `ggplot()`

lines.

We can now see how the actual and predicted values compare across our predictor variables. In case you’d forgotten, the coloured points are the actual data, and the white circles are the predicted values. With this in mind, we can see, as expected, that there is less variability in the predicted values than the actual values. It also appears that the sepal width of the setosa species is not as well accounted for as the other species.

To round this post off, let’s extend our approach to logistic regression. It’s going to require the same basic workflow, but we will need to extract predicted and residual values for the responses. Here’s an example predicting V/S (`vs`

), which is 0 or 1, with `hp`

:

```
# Step 1: Fit the data
d <- mtcars
fit <- glm(vs ~ hp, family = binomial(), data = d)
# Step 2: Obtain predicted and residuals
d$predicted <- predict(fit, type="response")
d$residuals <- residuals(fit, type = "response")
# Steps 3 and 4: plot the results
ggplot(d, aes(x = hp, y = vs)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

If we only want to flag cases that would be scored as the incorrect category, we can do something like the following (with some help from the dplyr function, `filter()`

):

```
ggplot(d, aes(x = hp, y = vs)) +
geom_segment(aes(xend = hp, yend = predicted), alpha = .2) +
geom_point() +
# > This plots large red circle on misclassified points
geom_point(data = d %>% filter(vs != round(predicted)),
color = "red", size = 2) +
# <
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
```

I’ll leave it to you to combine this with instructions from the previous sections if you’d like to extend it to multiple logistic regression. But, hopefully, you should now have a good idea of the steps involved and how to create these residual visualisations!

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To **leave a comment** for the author, please follow the link and comment on their blog: ** blogR**.

R-bloggers.com offers

Still there is a lot of duplicated effort between these packages on the one hand and a lot of incompatibilities between the packages on the other. The R ecosystem for text analysis is not exactly coherent or consistent at the moment.

My small contribution to the new text analysis ecosystem is the tokenizers package, which was recently accepted into rOpenSci after a careful peer review by Kevin Ushey. A new version of the package is on CRAN. (Also check out the Jeroen Ooms's hunspell package, which is a part of rOpensci.)

One of the basic tasks in any NLP pipeline is turning texts (which humans can read) into tokens (which machines can compute with). For example, you might break a text into words or into n-grams. Here is an example using the former slave interviews from the Great Depression era Federal Writers' Project. (A data package with those interviews is in development here).

```
# devtools::install_github("lmullen/WPAnarratives")
# install.packages("tokenizers")
library(WPAnarratives)
library(tokenizers)
text <- head(wpa_narratives$text, 5)
class(text)
## [1] "character"
words <- tokenize_words(text, lowercase = TRUE)
str(words)
## List of 5
## $ : chr [1:1141] "_he" "loved" "young" "marster" ...
## $ : chr [1:1034] "_old" "joe" "can" "keep" ...
## $ : chr [1:824] "_jesus" "has" "my" "chillun" ...
## $ : chr [1:779] "charity" "anderson" "who" "believes" ...
## $ : chr [1:350] "dat" "was" "one" "time" ...
ngrams <- tokenize_ngrams(text, n_min = 3, n = 5)
str(ngrams)
## List of 5
## $ : chr [1:3414] "_he loved young" "_he loved young marster" "_he loved young marster john_" "loved young marster" ...
## $ : chr [1:3093] "_old joe can" "_old joe can keep" "_old joe can keep his" "joe can keep" ...
## $ : chr [1:2463] "_jesus has my" "_jesus has my chillun" "_jesus has my chillun counted_" "has my chillun" ...
## $ : chr [1:2328] "charity anderson who" "charity anderson who believes" "charity anderson who believes she" "anderson who believes" ...
## $ : chr [1:1041] "dat was one" "dat was one time" "dat was one time when" "was one time" ...
```

Practically all text analysis packages provide their own functions for tokenizing text, so why do R users need this package?

First, these tokenizers are reasonably fast. The basic string operations are handled by the stringi package, which is quick while also doing the correct thing across encodings and locales. And Dmitriy Selivanov (author of the text2vec package) has written the n-gram and skip n-gram tokenizers in C++ so that those are fast too. It is probably possible to write tokenizers with better performance, but these are fast enough for even large scale text mining efforts.

The second and more important reason is that these tokenizers are consistent. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element of the input comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.

And third, the tokenizers are reasonably comprehensive, including functions for characters, lines, words, word stems, sentences, paragraphs, n-grams, skip n-grams, and regular expressions.

My hope is that developers of other text analysis packages for R will rely on this package to provide tokenizers. (So far only tidytext has taken me up on that, but I also have to re-write my own textreuse package now.) But even if natural language packages do not take the package as a formal dependency, most packages let you pass in your own tokenizing functions. So users can reap the benefits of a consistent set of tokenizers by using the functions in this package. The success of the "tidyverse" has shown the power of buying into a convention for the structure of data and the inputs and outputs of functions. My hope is that the tokenizers package is a step in that direction for text analysis in R.

]]>
(This article was first published on ** rOpenSci Blog - R**, and kindly contributed to R-bloggers)

The R package ecosystem for natural language processing has been flourishing in recent days. R packages for text analysis have usually been based on the classes provided by the NLP or tm packages. Many of them depend on Java. But recently there have been a number of new packages for text analysis in R, most notably text2vec, quanteda, and tidytext. These packages are built on top of Rcpp instead of rJava, which makes them much more reliable and portable. And instead of the classes based on NLP, which I have never thought to be particularly idiomatic for R, they use standard R data structures. The text2vec and quanteda packages both rely on the sparse matrices provided by the rock solid Matrix package. The tidytext package is idiosyncratic (in the best possible way!) for doing all of its work in data frames rather than matrices, but a data frame is about as standard as you can get. For a long time when I would recommend R to people, I had to add the caveat that they should use Python if they were primarily interested in text analysis. But now I no longer feel the need to hedge.

Still there is a lot of duplicated effort between these packages on the one hand and a lot of incompatibilities between the packages on the other. The R ecosystem for text analysis is not exactly coherent or consistent at the moment.

My small contribution to the new text analysis ecosystem is the tokenizers package, which was recently accepted into rOpenSci after a careful peer review by Kevin Ushey. A new version of the package is on CRAN. (Also check out the

Jeroen Ooms's hunspell package, which is a part of rOpensci.)

One of the basic tasks in any NLP pipeline is turning texts (which humans can read) into tokens (which machines can compute with). For example, you might break a text into words or into n-grams. Here is an example using the former slave interviews from the Great Depression era Federal Writers' Project. (A data package with those interviews is in development here).

```
# devtools::install_github("lmullen/WPAnarratives")
# install.packages("tokenizers")
library(WPAnarratives)
library(tokenizers)
text <- head(wpa_narratives$text, 5)
class(text)
## [1] "character"
words <- tokenize_words(text, lowercase = TRUE)
str(words)
## List of 5
## $ : chr [1:1141] "_he" "loved" "young" "marster" ...
## $ : chr [1:1034] "_old" "joe" "can" "keep" ...
## $ : chr [1:824] "_jesus" "has" "my" "chillun" ...
## $ : chr [1:779] "charity" "anderson" "who" "believes" ...
## $ : chr [1:350] "dat" "was" "one" "time" ...
ngrams <- tokenize_ngrams(text, n_min = 3, n = 5)
str(ngrams)
## List of 5
## $ : chr [1:3414] "_he loved young" "_he loved young marster" "_he loved young marster john_" "loved young marster" ...
## $ : chr [1:3093] "_old joe can" "_old joe can keep" "_old joe can keep his" "joe can keep" ...
## $ : chr [1:2463] "_jesus has my" "_jesus has my chillun" "_jesus has my chillun counted_" "has my chillun" ...
## $ : chr [1:2328] "charity anderson who" "charity anderson who believes" "charity anderson who believes she" "anderson who believes" ...
## $ : chr [1:1041] "dat was one" "dat was one time" "dat was one time when" "was one time" ...
```

Practically all text analysis packages provide their own functions for tokenizing text, so why do R users need this package?

First, these tokenizers are reasonably fast. The basic string operations are handled by the stringi package, which is quick while also doing the correct thing across encodings and locales. And Dmitriy Selivanov (author of the text2vec package) has written the n-gram and skip n-gram tokenizers in C++ so that those are fast too. It is probably possible to write tokenizers with better performance, but these are fast enough for even large scale text mining efforts.

The second and more important reason is that these tokenizers are consistent. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element of the input comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.

And third, the tokenizers are reasonably comprehensive, including functions for characters, lines, words, word stems, sentences, paragraphs, n-grams, skip n-grams, and regular expressions.

My hope is that developers of other text analysis packages for R will rely on this package to provide tokenizers. (So far only tidytext has taken me up on that, but I also have to re-write my own textreuse package now.) But even if natural language packages do not take the package as a formal dependency, most packages let you pass in your own tokenizing functions. So users can reap the benefits of a consistent set of tokenizers by using the functions in this package. The success of the "tidyverse" has shown the power of buying into a convention for the structure of data and the inputs and outputs of functions. My hope is that the tokenizers package is a step in that direction for text analysis in R.

To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci Blog - R**.

R-bloggers.com offers

(This article was first published on ** DataCamp Blog**, and kindly contributed to R-bloggers)

Are you teaching a course this semester that makes use of R? Now you can integrate DataCamp’s free interactive R courses and tutorials with all major learning management systems at no cost. Learn More.

Why is this exciting? Well, DataCamp’s autograding features can save you a considerable amount of time by tracking your student’s progress and automatically sending their scores to your LMS environment. In addition, you can use DataCamp’s group dashboard to set assignments, manage deadlines, download detailed stats on student progress, create leaderboards, and more!

Integration is easy and takes just 5 minutes to set up. Some of the available free courses are:

**Click here to get started with a free integration!**

PS: Interested in creating your own courses? LMS integration is also supported for your self-developed interactive R courses using https://www.datacamp.com/teach

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Blog**.

R-bloggers.com offers