(This article was first published on

**R – Giga thoughts …**, and kindly contributed to R-bloggers)Note: To upload the CSV to databricks see the video Upload Flat File to Databricks Table

[1] 347 12

1b. Read CSV – SparkR

[1] 347 12

2a. Data frame shape – R

# Get the shape of the dataframe in R dim(tendulkar)

[1] 347 12

2b. Dataframe shape – SparkR

The same ‘dim’ command works in SparkR too!

dim(tendulkar1)

[1] 347 12

3a . Dataframe columns – R

# Get the names names(tendulkar) # Also colnames(tendulkar)

3b. Dataframe columns – SparkR

names(tendulkar1)

4a. Rename columns – R

4b. Rename columns – R

5a. Summary – R

summary(tendulkar)

5b. Summary – SparkR

summary(tendulkar1)

6a. Displaying details of dataframe with str() – R

str(tendulkar)

6b. Displaying details of dataframe with str() – SparkR

str(tendulkar1)

7a. Head & tail -R

print(head(tendulkar),3) print(tail(tendulkar),3)

7b. Head – SparkR

head(tendulkar1,3)

8a. Determining the column types with sapply -R

sapply(tendulkar,class)

8b. Determining the column types with printSchema – SparkR

printSchema(tendulkar1)

9a. Selecting columns – R

library(dplyr) df=select(tendulkar,Runs,BallsFaced,Minutes) head(df,5)

Runs BallsFaced Minutes 1 15 24 28 2 DNB NA NA 3 59 172 254 4 8 16 24 5 41 90 124

9b. Selecting columns – SparkR

Runs BF Mins 1 15 24 28 2 DNB - - 3 59 172 254 4 8 16 24 5 41 90 124 6 35 51 74

10a. Filter rows by criteria – R

library(dplyr) df=tendulkar %>% filter(Runs > 50) head(df,5)

10b. Filter rows by criteria – SparkR

df=SparkR::filter(tendulkar1, tendulkar1$Runs > 50) head(SparkR::collect(df))

11a. Unique values -R

unique(tendulkar$Runs)

11b. Unique values – SparkR

head(SparkR::distinct(tendulkar1[,"Runs"]),5)

Runs 1 119* 2 7 3 51 4 169 5 32*

12a. Aggregate – Mean, min and max – R

12b. Aggregate- Mean, Min, Max – SparkR

13a Using SQL with SparkR

Conclusion

This post discusses some of the key constructs in R and SparkR and how one can transition from R to SparkR fairly easily. I will be adding more constructs later. Do check back!

You may also like

1. Exploring Quantum Gate operations with QCSimulator

2. Deep Learning from first principles in Python, R and Octave – Part 4

3. A Bluemix recipe with MongoDB and Node.js

4. Practical Machine Learning with R and Python – Part 5

5. Introducing cricketr! : An R package to analyze performances of cricketers

To see all posts click Index of posts

*Related*

To

**leave a comment**for the author, please follow the link and comment on their blog:**R – Giga thoughts …**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not

**from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...**__subscribe for updates__
This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR

Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkRNote 2: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR