I’ve Tried R for the First Time - How Bad Was It?

[This article was first published on r – Better Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s not a secret that I’m a heavy Python user. Just take a look at my profile and you’ll find over 100 articles on Python itself, or Python in data science. Lately, I’ve been trying out a lot of new languages and technologies, with R being the one I resisted the most. Below you’ll find my discoveries, comparisons with Python, and overall opinion on the language itself.

The biggest reason I’ve ignored R for so long is the lack of information on the language. Everyone I know who used it presented it strictly as a statistical language. Statistics is essential for data science, but what’s the point of building a model if you can’t present it — through a dashboard —  and deploy it — as a REST API

Those were my thoughts up to recently, but since then I discovered Shiny and Plumber, which basically solve the issues I had with R in the first place. That being said, there’s no point in avoiding the language anymore, and this article is the first of many in the R series.

Today we’ll compare R and Python in the process of exploratory data analysis and data visualization, both through code and through final outputs. I’m heavily Python-biased, but my conclusions may still surprise you. Keep reading to find out.

Anyhow, let’s start with the comparisons, shall we?

Exploratory data analysis

EDA is where data scientists spend the majority of their time, so an easy-to-write and easy-to-understand language is a must. I’m using external libraries in both languages — Pandas in Python and Tidyverse in R.

Dataset loading

We’ll use the MPG dataset for this article. It’s built into R, but there isn’t the same dataset in Python. To accommodate, I’ve exported the dataset from R as a CSV so we can start clean with both languages.

Here’s how to read CSV file with R:

mpg <- read.csv(‘mpg.csv’)

The head function is used in R to see the first 6 rows, and the end result looks like this:

Let’s do the same with Python:

mpg = pd.read_csv(‘mpg.csv’)

Great! Looks like we have an extra column —  X in R and Unnamed: 0 in Python, so let’s remove those next.

Removing attributes

Here’s how to remove the unwanted X column in R:

mpg <- mpg 
  %>% select(-X)

And here’s the Python variant:

mpg.drop(‘Unnamed: 0’, axis=1, inplace=True)

Specifying column names like variables (without quotation marks) is not something I’m most comfortable with, but it is what it is.

Filtering data

Let’s continue with something a bit more interesting — data filtering or subsetting. We’ll see how to select only those records where the number of cylinders cyl is 6.

With R:

head(mpg %>%
  filter(cyl == 6))

Keep in mind that head function is only here so we don’t get a ton of output in the console. It is not part of the data filtering process.

And the same with Python:

mpg[mpg[‘cyl’] == 6].head()

Awesome! Let’s see what more can we do.

Creating derived columns

We’ll create a boolean attribute is_newer, which is True if the car was made in 2005 or after, and False otherwise.

Here’s the R syntax:

head(mpg %>%
  mutate(is_newer = year >= 2005))

And here’s the same thing with Python:

mpg[‘is_newer’] = mpg[‘year’] >= 2005

And that’s it for the EDA. Let’s make a brief conclusion on it next.

EDA final thoughts

It’s hard to pick a winner here since both languages are great. I repeat, it’s very strange for me not to put quotation marks around the column names, but that’s just something I’ll have to get used to.

Furthermore, I absolutely love the easiness of chaining things in R. Here’s an example:

mpg <-
  read.csv(‘mpg.csv’) %>%
  select(-X) %>% 
  filter(cyl == 6) %>%
  mutate(is_newer = year >= 2005) %>%
  select(displ, year, cyl, is_newer)

Here we basically did everything from above and more, all in a single command. Let’s proceed with the data visualization part.

Data visualization

When it comes to data visualization, one thing is certain — Python doesn’t stand a chance! Well, at least if we’re talking about the default options for both languages. The following libraries were used for this comparison:

  • ggplot2 — for R
  • matplotlib — for Python

Let’s start with a simple scatter plot of engine displacement on the X-axis and highway MPG on the Y-axis.

Here’s the R syntax and results:

ggplot(data = mpg, aes(x = displ, y = hwy)) + 

And for Python:

plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’])

Neither of these looks particularly good out of the box, but R is miles ahead in this department, at least with the default styles.

Let’s now add some colors. The points should be colored according to the class attribute, so we can easily know where each type of car is located. 

Here’s the syntax and result for R:

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 

It doesn’t get much easier than that, and the Python example below is a clear indicator. I haven’t managed to find an easy way to map out a categorical variable as colors (at least with Matplotlib), so here’s what I ended up with:

def get_color(car_class):
    colors = {
        ‘compact’   : ‘brown’,
        ‘midsize’   : ‘green’,
        ‘suv’       : ‘pink’,
        ‘2seater’   : ‘red’,
        ‘minivan’   : ‘teal’,
        ‘pickup’    : ‘blue’,
        ‘subcompact’: ‘purple’
 return colors[car_class]

colors = mpg[‘class’].apply(get_color)

plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’], c=colors)

All of that work for a not-so-appealing chart. Point for R.

Let’s now finalize the chart by adding a title and labels for axes. Here’s how to do it in R:

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point(size = 3) + 
  labs(title = ‘Engine displacement vs. Highway MPG’,
       x = ‘Engine displacement (liters)’,
       y = ‘Highway miles per gallon’)

Again, fairly straightforward syntax, and the chart looks amazing (well, kind of).

Here’s how to do the same with Python:

plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’], c=colors, s=75)
plt.title(‘Engine displacement vs. Highway MPG’)
plt.xlabel(‘Engine displacement (liters)’)
plt.ylabel(‘Highway miles per gallon’)

It’s up to you to decide which looks better, but R is a clear winner in my opinion. Visualizations can be tweaked, of course, but I deliberately wanted to use the default libraries for both languages. I know that Seaborn looks better, there’s no point in telling me that in the comment section.

And that about does it for this article. Let’s wrap things up in the next section.

Final thoughts

This was a rather quick comparison of R and Python in the realm of data science. Choosing one over the other isn’t a simple task, as both are great. Among the two, Python is considered to be a general-purpose language, so it’s the only viable option if you want to build software with data science, and not work directly in data science.

You can’t go wrong with either — especially now when I know that R supports dashboards, web scraping, and API development. More articles like this one are to come, guaranteed.

Thanks for reading.

Join my private email list for more helpful insights.

The post I’ve Tried R for the First Time - How Bad Was It? appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: r – Better Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)