First mlverse survey results – software, applications, and beyond

Posted on February 16, 2021 by Sigrid Keydana in R bloggers | 0 Comments

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thank you everyone who participated in our first mlverse survey!

Wait: What even is the mlverse?

The mlverse originated as an abbreviation of multiverse¹, which, on its part, came into being as an intended allusion to the well-known tidyverse. As such, although mlverse software aims for seamless interoperability with the tidyverse, or even integration when feasible (see our recent post featuring a wholly tidymodels-integrated torch network architecture), the priorities are probably a bit different: Often, mlverse software’s raison d’être is to allow R users to do things that are commonly known to be done with other languages, such as Python.

As of today, mlverse development takes place mainly in two broad areas: deep learning, and distributed computing / ML automation. By its very nature, though, it is open to changing user interests and demands. Which leads us to the topic of this post.

The survey

GitHub issues and community questions are valuable feedback, but we wanted something more direct. We wanted a way to find out how you, our users, employ the software, and what for; what you think could be improved; what you wish existed but is not there (yet). To that end, we created a survey. Complementing software- and application-related questions for the above-mentioned broad areas, the survey had a third section, asking about how you perceive ethical and social implications of AI as applied in the “real world”.

A few things upfront:

Firstly, the survey was completely anonymous, in that we asked for neither identifiers (such as e-mail addresses) nor things that render one identifiable, such as gender or geographic location. In the same vein, we had collection of IP addresses disabled on purpose.

Secondly, just like GitHub issues are a biased sample, this survey’s participants must be. Main venues of promotion were rstudio::global, Twitter, LinkedIn, and RStudio Community. As this was the first time we did such a thing (and under significant time constraints), not everything was planned to perfection – not wording-wise and not distribution-wise. Nevertheless, we got a lot of interesting, helpful, and often very detailed answers, – and for the next time we do this, we’ll have our lessons learned!

Thirdly, all questions were optional, naturally resulting in different numbers of valid answers per question. On the other hand, not having to select a bunch of “not applicable” boxes freed respondents to spend time on topics that mattered to them.

As a final pre-remark, most questions allowed for multiple answers.

In sum, we ended up with 138 completed² surveys. Thanks again everyone who participated, and especially, thank you for taking the time to answer the – many – free-form questions!

Deep learning

Areas and applications

Our first goal was to find out in which settings, and for what kinds of applications, deep-learning software is being used.

Overall, 72 respondents reported using DL in their jobs in industry, followed by academia (23), studies (21), spare time (43), and not-actually-using-but-wanting-to (24).

Of those working with DL in industry, more than twenty said they worked in consulting, finance, and healthcare (each). IT, education, retail, pharma, and transportation were each mentioned more than ten times:

(#fig:unnamed-chunk-1)Number of users reporting to use DL in industry. Smaller groups not displayed.

In academia, dominant fields (as per survey participants) were bioinformatics, genomics, and IT, followed by biology, medicine, pharmacology, and social sciences:

(#fig:unnamed-chunk-2)Number of users reporting to use DL in academia. Smaller groups not displayed.

What application areas matter to larger subgroups of “our” users? Nearly a hundred (of 138!) respondents said they used DL for some kind of image-processing application (including classification, segmentation, and object detection). Next up was time-series forecasting, followed by unsupervised learning.

The popularity of unsupervised DL was a bit unexpected; had we anticipated this, we would have asked for more detail here. So if you’re one of the people who selected this – or if you didn’t participate, but do use DL for unsupervised learning – please let us know a bit more in the comments!

Next, NLP was about on par with the former; followed by DL on tabular data, and anomaly detection. Bayesian deep learning, reinforcement learning, recommendation systems, and audio processing were still mentioned frequently.

(#fig:unnamed-chunk-3)Applications deep learning is used for. Smaller groups not displayed.

Frameworks and skills

We also asked what frameworks and languages participants were using for deep learning, and what they were planning on using in the future. Single-time mentions (e.g., deeplearning4J) are not displayed.

(#fig:unnamed-chunk-4)Framework / language used for deep learning. Single mentions not displayed.

An important thing for any software developer or content creator to investigate is proficiency/levels of expertise present in their audiences. It (nearly) goes without saying that expertise is very different from self-reported expertise. I’d like to be very cautious, then, to interpret the below results.

While with regard to R skills³, the aggregate self-ratings look plausible (to me), I would have guessed a slightly different outcome re DL. Judging from other sources (like, e.g., GitHub issues), I tend to suspect more of a bimodal distribution (a far stronger version of the bimodality we’re already seeing, that is). To me, it seems like we have rather many users who know a lot about DL. In agreement with my gut feeling, though, is the bimodality itself – as opposed to, say, a Gaussian shape.

But of course, sample size is moderate, and sample bias is present.

(#fig:unnamed-chunk-5)Self-rated skills re R and deep learning.

Wishes and suggestions

Now, to the free-form questions. We wanted to know what we could do better.

I’ll address the most salient topics in order of frequency of mention.⁴ For DL, this is surprisingly easy (as opposed to Spark, as you’ll see).

“No Python”

The number one concern with deep learning from R, for survey respondents, clearly has to do not with R but with Python. This topic appeared in various forms, the most frequent being frustration over how hard it can be, dependent on the environment, to get Python dependencies for TensorFlow/Keras correct. (It also appeared as enthusiasm for torch, which we are very happy about.)

Let me clarify and add some context.

TensorFlow is a Python framework (nowadays subsuming Keras, which is why I’ll be addressing both of those as “TensorFlow” for simplicity) that is made available from R through packages tensorflow and keras . As with other Python libraries, objects are imported and accessible via reticulate . While tensorflow provides the low-level access, keras brings idiomatic-feeling, nice-to-use wrappers that let you forget about the chain of dependencies involved.

On the other hand, torch, a recent addition to mlverse software, is an R port of PyTorch that does not delegate to Python. Instead, its R layer directly calls into libtorch, the C++ library behind PyTorch. In that way, it is like a lot of high-duty R packages, making use of C++ for performance reasons.

Now, this is not the place for recommendations. Here are a few thoughts though.

Clearly, as one respondent remarked, as of today the torch ecosystem does not offer functionality on par with TensorFlow, and for that to change time and – hopefully! more on that below – your, the community’s, help is needed. Why? Because torch is so young, for one; but also, there is a “systemic” reason! With TensorFlow, as we can access any symbol via the tf object, it is always possible, if inelegant, to do from R what you see done in Python. Respective R wrappers nonexistent⁵, quite a few blog posts (see, e.g., https://blogs.rstudio.com/ai/posts/2020-04-29-encrypted_keras_with_syft/, or A first look at federated learning with TensorFlow) relied on this!

Switching to the topic of tensorflow’s Python dependencies causing problems with installation, my experience (from GitHub issues, as well as my own) has been that difficulties are quite system-dependent. On some OSes, complications seem to appear more often than on others; and low-control (to the individual user) environments like HPC clusters can make things especially difficult. In any case though, I have to (unfortunately) admit that when installation problems appear, they can be very tricky to solve.

`tidymodels` integration

The second most frequent mention clearly was the wish for tighter tidymodels integration. Here, we wholeheartedly agree. As of today, there is no automated way to accomplish this for torch models generically, but it can be done for specific model implementations.

Last week, torch, tidymodels, and high-energy physics featured the first tidymodels-integrated torch package. And there’s more to come. In fact, if you are developing a package in the torch ecosystem, why not consider doing the same? Should you run into problems, the growing torch community will be happy to help.

Documentation, examples, teaching materials

Thirdly, several respondents expressed the wish for more documentation, examples, and teaching materials. Here, the situation is different for TensorFlow than for torch.

For tensorflow, the website has a multitude of guides, tutorials, and examples. For torch, reflecting the discrepancy in respective lifecycles, materials are not that abundant (yet). However, after a recent refactoring, the website has a new, four-part Get started section addressed to both beginners in DL and experienced TensorFlow users curious to learn about torch. After this hands-on introduction, a good place to get more technical background would be the section on tensors, autograd, and neural network modules.

Truth be told, though, nothing would be more helpful here than contributions from the community. Whenever you solve even the tiniest problem (which is often how things appear to oneself), consider creating a vignette explaining what you did. Future users will be thankful, and a growing user base means that over time, it’ll be your turn to find that some things have already been solved for you!

Community, community, community

The remaining items discussed didn’t come up quite as often (individually), but taken together, they all have something in common: They all are wishes we happen to have, as well!

This definitely holds in the abstract – let me cite:

“Develop more of a DL community”

“Larger developer community and ecosystem. Rstudio has made great tools, but for applied work is has been hard to work against the momentum of working in Python.”

We wholeheartedly agree, and building a larger community is exactly what we’re trying to do. I like the formulation “a DL community” insofar it is framework-independent. In the end, frameworks are just tools, and what counts is our ability to usefully apply those tools to problems we need to solve.

Concrete wishes include

More paper/model implementations (such as TabNet).
Facilities for easy data reshaping and pre-processing (e.g., in order to pass data to RNNs or 1dd convnets in the expected 3-d format).
Probabilistic programming for torch (analogously to TensorFlow Probability).
A high-level library (such as fast.ai) based on torch.

In other words, there is a whole cosmos of useful things to create; and no small group alone can do it. This is where we hope we can build a community of people, each contributing what they’re most interested in, and to whatever extent they wish.