533 search results for "Hadoop"

Visualizing taxi trips between NYC neighborhoods with Spark and Microsoft R Server

December 14, 2016
By
Visualizing taxi trips between NYC neighborhoods with Spark and Microsoft R Server

by Ali Zaidi, Data Scientist at Microsoft In previous post we showcased the use of the sparklyr package for manipulating large datasets using a familiar dplyr syntax on top of Spark HDInsight Clusters. In this post, we will take a look at the RxSpark API for R, part of the RevoScaleR package and the Microsoft R Server distribution of...

Read more »

Analysis of software developers in New York, San Francisco, London and Bangalore

December 1, 2016
By
Analysis of software developers in New York, San Francisco, London and Bangalore

(Note: Cross-posted with the Stack Overflow Blog.) When I tell someone Stack Overflow is based in New York City, they’re often surprised: many people assume it’s in San Francisco. (I’ve even seen job applications with “I’m in New York but willing to relocate to San Francisco” in the cover letter.) San Francisco is a safe guess of where an...

Read more »

Data Manipulation with sparklyr on Azure HDInsight

November 8, 2016
By
Data Manipulation with sparklyr on Azure HDInsight

by Ali Zaidi, Data Scientist at Microsoft # Apache Spark and a Tale of APIs Spark is an exceptionally popular processing engine for distributed data. Dealing with data in distributed storage and programming with concurrent systems often requires learning complicated new paradigms and techniques. Statisticans and data scientists familiar wtih R are unlikely to have much experience with such...

Read more »

sparklyr: a test drive on YARN

sparklyr: a test drive on YARN

sparklyr is a new R front-end for Apache Spark, developed by the good people at RStudio. It offers much more functionality compared to the existing SparkR interface by Databricks, allowing both dplyr-based data transformations, as well as access to the machine learning libraries of both Spark and H2O Sparkling Water. Moreover, the latest RStudio IDE v1.0 now offers native...

Read more »

Using rstudio and sparklyr with an apache cluster on Google DataProc

October 23, 2016
By
Using rstudio and sparklyr with an apache cluster on Google DataProc

Last week, I came across sparklyr. Authored by the folks at rstudio, it allows you to integrate your R workflow (and, more importantly, your dplyr workflow) with apache spark. In one of the examples on the sparklyr home page, the author shows how to se...

Read more »

Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

October 4, 2016
By
Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

by Udayan Kumar, Data Scientist at Microsoft Beginning in 2016, Microsoft rolled out a preview of Microsoft R Server (MRS) for Azure HDInsight clusters. This service provides a preconfigured instance of R server with Spark/Hadoop that can be provisioned within minutes. Recent blog posts (by Max Kaznady and David Smith) have highlighted how to use and tune this service...

Read more »

GoodReads: Machine Learning (Part 3)

September 30, 2016
By
GoodReads: Machine Learning (Part 3)

In the first installment of this series, we scraped reviews from Goodreads. In the second one, we performed exploratory data analysis and created new variables. We are now ready for the “main dish”: machine learning! Setup and general data prep Let’s start by loading the libraries and our dataset. library(data.table) library(dplyr) library(caret) library(RTextTools) library(xgboost) library(ROCR) Related Post

Read more »

sparklyr — R interface for Apache Spark

September 27, 2016
By
sparklyr — R interface for Apache Spark

We’re excited today to announce sparklyr, a new package that provides an interface between R and Apache Spark. Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include: Interactively manipulate Spark data using

Read more »

Machine Learning for Drug Adverse Event Discovery

September 26, 2016
By
Machine Learning for Drug Adverse Event Discovery

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations Related Post

Read more »

Learning Statistics on Youtube

September 19, 2016
By
youtube

Youtube.com is the second most accessed website in the world (surpassed only by its parent, google.com). It has a whopping 1 billion unique views a month. It is a force to be reckoned with. In the video sharing platform, there are many brilliant and hard-working content creators producing high-quality and free educational videos...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)