531 search results for "hadoop"

Data Manipulation with sparklyr on Azure HDInsight

November 8, 2016
By
Data Manipulation with sparklyr on Azure HDInsight

by Ali Zaidi, Data Scientist at Microsoft # Apache Spark and a Tale of APIs Spark is an exceptionally popular processing engine for distributed data. Dealing with data in distributed storage and programming with concurrent systems often requires learning complicated new paradigms and techniques. Statisticans and data scientists familiar wtih R are unlikely to have much experience with such...

Read more »

sparklyr: a test drive on YARN

sparklyr: a test drive on YARN

sparklyr is a new R front-end for Apache Spark, developed by the good people at RStudio. It offers much more functionality compared to the existing SparkR interface by Databricks, allowing both dplyr-based data transformations, as well as access to the machine learning libraries of both Spark and H2O Sparkling Water. Moreover, the latest RStudio IDE v1.0 now offers native...

Read more »

Using rstudio and sparklyr with an apache cluster on Google DataProc

October 23, 2016
By
Using rstudio and sparklyr with an apache cluster on Google DataProc

Last week, I came across sparklyr. Authored by the folks at rstudio, it allows you to integrate your R workflow (and, more importantly, your dplyr workflow) with apache spark. In one of the examples on the sparklyr home page, the author shows how to se...

Read more »

Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

October 4, 2016
By
Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

by Udayan Kumar, Data Scientist at Microsoft Beginning in 2016, Microsoft rolled out a preview of Microsoft R Server (MRS) for Azure HDInsight clusters. This service provides a preconfigured instance of R server with Spark/Hadoop that can be provisioned within minutes. Recent blog posts (by Max Kaznady and David Smith) have highlighted how to use and tune this service...

Read more »

GoodReads: Machine Learning (Part 3)

September 30, 2016
By
GoodReads: Machine Learning (Part 3)

In the first installment of this series, we scraped reviews from Goodreads. In the second one, we performed exploratory data analysis and created new variables. We are now ready for the “main dish”: machine learning! Setup and general data prep Let’s start by loading the libraries and our dataset. library(data.table) library(dplyr) library(caret) library(RTextTools) library(xgboost) library(ROCR) Related Post

Read more »

sparklyr — R interface for Apache Spark

September 27, 2016
By
sparklyr — R interface for Apache Spark

We’re excited today to announce sparklyr, a new package that provides an interface between R and Apache Spark. Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include: Interactively manipulate Spark data using

Read more »

Machine Learning for Drug Adverse Event Discovery

September 26, 2016
By
Machine Learning for Drug Adverse Event Discovery

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations Related Post

Read more »

Learning Statistics on Youtube

September 19, 2016
By
youtube

Youtube.com is the second most accessed website in the world (surpassed only by its parent, google.com). It has a whopping 1 billion unique views a month. It is a force to be reckoned with. In the video sharing platform, there are many brilliant and hard-working content creators producing high-quality and free educational videos...

Read more »

A few thoughts on the existing code parallelization

September 17, 2016
By

A few weeks ago I worked on some old code parallelization. The whole process made me think about how efficient parallelization of the existing code in R can really be and what should be considered efficient. There is a lot … Continue reading →

Read more »

GoodReads: Exploratory data analysis and sentiment analysis (Part 2)

September 14, 2016
By
GoodReads: Exploratory data analysis and sentiment analysis (Part 2)

After scraping reviews from Goodreads in the first installment of this series, we are now ready to do some exploratory data analysis to get a better sense of the data we have. This will also allow us to create features that we will use in future analyses. Setup and data preparation We start by loading Related Post

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)