R or Python for Data Science?

January 16, 2017
By

(This article was first published on R – Greetz to Geeks, and kindly contributed to R-bloggers)

Addressing the question ‘R or Python for data science’ depends mainly on the problems which is to be solved, the tools required to solve the problem and your personal preference.

Python is a general purpose programming language created by Guido Van Rossum in 1991 and R was created four years later by Ross Ihaka and Robert Gentleman keeping the statisticians in mind.

R has a steep learning curve which makes it a bit difficult for beginners but once the basics are clear it will be easy to learn advanced stuffs. On the other hand, the simplicity and readability of Python makes its learning curve relatively low and also it is a good choice for beginners.

The same functionality can be written in different ways in R but it is not the same in Python.

RStudio is the best IDE for R. Spyder, IPython, Notebook, Eric etc are some of the IDE for Python. Both R and Python have a huge number of reliable libraries. The CRAN is the biggest repository of R packages while PyPi is the Python repository.

The popular libraries in R includes caret, dplyr, data.tables, zoo, ggplot2, ggvis, stringr, lattice etc. Libraries like Pandas, Scikit Learn, SciPy, NumPy, matplotlib etc makes Python more attractive. Both R and Python have a good support and documentation.

When it comes to data visualization, R has an upper hand over Python. Packages like ggplot2 and ggvis are two incredible visualization packages in R.

Few examples of codes from both the languages which are used to get the same results.

To import a .csv dataset,
R:
dataset_name <- read.csv(“dataset_name.csv”)

Python:
import pandas
dataset_name = pandas.read_csv(“dataset_name.csv”)

To find the dimension of the dataset,
R:
dim(dataset_name)

Python:
dataset_name.shape

To obtain the first n observation in a dataframe,
R:
head(dataset_name)

Python:
dataset_name.head()

For splitting the dataset into training and test sets,
R:
RowCount <- floor(0.75 * nrow(dataset_name))
set.seed(123)
trainIndex <- sample(1:nrow(dataset_name), RowCount)
train <- dataset_name[trainIndex,]
test <- dataset_name[-trainIndex,]

Python:
train = dataset_name.sample(frac=0.75, random_state=1)
test = dataset_name.loc[~dataset_name.index.isin(train.index)]

R is more functional in nature and has a lot of build-in data analysis features. On the other hand Python is object oriented language which mostly relay on packages for data analysis. When it comes to data science, both these languages are important and it depends on the data analyst to choose between the two. If you know both, then you are definitely ahead of many others in this field.

To leave a comment for the author, please follow the link and comment on their blog: R – Greetz to Geeks.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)