Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is not the usual R vs Python post you can find online, in fact I won’t discuss whether one is better than the other. I will show to you why a learner who wants to learn data science will have an advantage by starting with R.

Vectors

What are vectors? If you know matrices, you know vectors. They can be seen as rows or columns of matrices, so what we have is a one-dimensional “list” of numbers. Usually vectors are used as columns for data frames, that is because we are sure that in a column we have data of the same type.

Float, integer, string, categorical, etc a vector has always only one type. This is important because we can make our code faster and clearer: the interpreter will have to check the type of the first record and that’s it. As you may know in R vectors are native, actually even a scalar is a vector.

vec <- c(5, 3, 4)

class(vec)
[1] "numeric"

class(3)
[1] "numeric"

Vectorization

When performing data analysis or machine learning, I will often work with data in a tabular format, or at a lower level, with a series of vectors. If I want to multiply every record in a vector by 2 it’s pretty natural to do:

vec * 2
[1] 10  6  8

In Python you can use lists to store your vectors, so let’s try the same with Python 3 (the fact you have to worry about 2 vs 3 is all another issue)

>>> [5, 3, 4] * 2
[5, 3, 4, 5, 3, 4]

WAT…

It turns out that the only way to get the same result in native Python is to perform a for loop:

>>> for num in [5, 3, 4]:
...     num * 2
...
10
6
8

You may want to store the result in a list as the input, so you have to initialize an empty list out of the loop and append results to it:

>>> res = []
>>> for num in [5, 3, 4]:
...     res.append(num * 2)
...
>>> print(res)
[10, 6, 8]

The same code in R would be:

vec <- c(5, 3, 4) * 2
vec
[1] 10  6  8

I would stress that it isn’t much about less typing, but more about the formation of the “right” mental model. Many people complain because their R code is slow, 99% of the time this is because they didn’t vectorize their code by coding “Python style” with loops, either hidden or explicit.

Random Walk Example

We will perform a random walk in R and Python, for the latter the examples are taken from “From Python to NumPy” book.

Let’s start from the most basic approach by looping:

>>> import random # random module needed

>>> def random_walk(n):
...     position = 0  # initialize the position variable
...     walk = [position]  # initialize a list
...     for i in range(n):
...         position += 2*random.randint(0, 1)-1 # update position value
...         walk.append(position)  # append results to walk list
...     return walk
...

This code can get slow for very large objects, we can improve it by using the itertools module:

>>> from itertools import accumulate
>>> import random

>>> def random_walk_faster(n=1000):
...     steps = random.sample([1, -1]*n, n)
...     return list(accumulate(steps))
...

Anyway, this isn’t vectorized yet. It’s just a more efficient way to loop. To reach full vectorization we need NumPy:

>>> import numpy as np

>>> def random_walk_fastest(n=1000):
...     steps = 2*np.random.randint(0, 2, size=n) - 1
...     return np.cumsum(steps)
...

Take a close look at the methods derived from NumPy.

The same R code:

rw <- cumsum(sample(c(-1, 1), 1000, TRUE))

No imports, no real need to define a function or a method, code packed in one line.

Conclusion

If you want to be a data “something”, or if you want to teach someone start with R. After reaching confidence with R, start with Python.

If you enjoyed this you can let me know commenting below, or by spreading this post. You can also follow the blog and/or subscribe to the newsletter.

The post Learn R before Python appeared first on rDisorder.