Mastering the art of Data Preparation II

September 15, 2016
By

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

It is not the beauty of a building you should look at; it is the construction of the foundation that will stand the test of time. ~ David Allan Coe

This week we focus on getting data from MYSQL, HDF5, API and the WEB.

Extracting Data from MYSQL

Mysql is one of the most widely used open source databases.The data is stored in tabular format (similar to dataframes in R) with rows representing a record and columns representing variable names.

Connecting to MYSQL

> install.packages(“RMySQL”) #install the library required
> library(RMySQL) #loading the required package into R
> ucscDb<-dbConnect(MySQL(),user=”genome”,host=”genome- mysql.cse.ucsc.edu”) # Connecting to the database
> result<-dbGetQuery(ucscDb,"show databases;")#extracting the result of a query
> dbDisconnect(ucscDb) # Disconnecting from the database
Displaying the top 6 rows extracted from MYSQL

Accessing tables and data within tabular data

> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") #Connects to the database             # named ailMel1
> Tables<-dbListTables(DbCon) #Fetches the tables within the database
> Tables # Displays the tables
List of tables in the database ailMel1
> dbListFields(DbCon,”nestedRepeats”) #Lists all the fields within a #table
Fields of Table nestedRepeats

Extracting a table directly

> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") #Connects to the database
> Data<-dbReadTable(DbCon,”nestedRepeats”) #Reads the data
> head(Data) #Displays the top 6 rows
Displays the top 6 rows of table nestedRepeats

Select Clause to fetch rows

> DbCon<-dbConnect(MySQL(),user=”genome”,host=”genome-mysql.cse.ucsc.edu”,db=”ailMel1") # Connects to the db
> query<-dbSendQuery(DbCon,”Select * from chainCanFam2;”) #Sends the query to the server
> Result<-fetch(query)#Fetches the result
> head(Result)#displays the top 6 rows
displays the top 6 rows of table chainCanFam2

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Extracting Data from HDF5

HDF5 stands for Hierarchical Data Format v5 and is closely modeled on file systems. In HDF5, a “group” is analogous to a directory, a “dataset” is like a file. HDF5 also uses “attributes” to associate metadata with a particular group or dataset. HDF5 uses ASCII names for these different objects, and objects can be accessed by UNIX-like pathnames, e.g., “/sample1/tempsensor/firsttrial” for a top-level group “sample1”, a subgroup “tempsensor”, and a dataset “firsttrial”.

Getting the required packages

> source(“https://bioconductor.org/biocLite.R")

> bicLite("rhdf5") #installs the rhdf5 package

Creating a HDF5 File

> sample<-h5createFile("exampleHDF5.h5")#create a hDF5 file
> sample<-h5createGroup("exampleHDF5.h5","History")#creates a group #called History
> sample<-h5createGroup("exampleHDF5.h5","English")#Creates a group called English
> sample<-h5createGroup("exampleHDF5.h5","English/English1")#Creates a group called English1 within English
> h5ls("example.h5")#displays all the components of the HDF5 file.
exampleHDF5.h5

Loading Data into Files

> A=matrix(1901:1910,nr=5,nc=2)#generating data
> h5write(A,"exampleHDF5.h5","History/A")#inserting data into file

Reading from HDF5 file

> result_HDF5<-h5read(“exampleHDF5.h5”,”History/A”)#reads the HDF5 file
> result_HDF5 #displays the result
HDF5 File into R

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —-

Extracting Data from the Web

Web Scraping — Programmatically extracting data from the HTML code of websites.

httr package- The aim of httr is to provide a wrapper for the curl package, customised to the demands of modern web APIs.

Key features:

  • Functions for the most important http verbs: GET(), HEAD(), PATCH(), PUT(), DELETE() and POST().
  • Automatic connection sharing across requests to the same website (by default, curl handles are managed automatically), cookies are maintained across requests, and a up-to-date root-level SSL certificate store is used.
> install.packages("XML") 
> install.packages("httr")
> library(httr)
> library(XML)
> html2=GET("https://www.datazar.com/")
> content2=content(html2,as="text")
> parsedHtml<-htmlParse(content2,asText=TRUE)
> xpathSApply(parsedHtml,"//title",xmlValue)
The title of the page scraped

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Extracting Data from API’s

An API is a software intermediary that makes it possible for application programs to interact with each other and share data. It’s often an implementation of REST that exposes a specific software functionality while protecting the rest of the application.

Getting Data from Twitter API

Loading the libraries

> library(wordcloud)
> library(ggmap)
> library(stringr)
> library(tm)
> library(RCurl)
> library(dplyr)
> library(plyr)
> library(stringr)
> library(ROAuth)
> library(twitteR)

Creating a twitter developer account

Step 1

Go to https://apps.twitter.com/

Step 2

Step-3

Step 4

Go to Keys and Token

Click on keys and Tokens at the top Navigation bar

Step 5 — You need to click the create Access token which will appear and then you can view your access tokens like this

Authentication in R

> consumer_key <- "obtained from twitter"
> consumer_secret <- "obtained from twitter"
> access_token <- "obtained from twitter"
> access_secret <- "obtained from twitter"
> setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

If you want to test your authentication just try to get some tweets with:

searchTwitter(“iphone”)

To get a detailed understanding about how to use API’s to do sentimental analysis, Please refer to the project here.

References-https://www.wikipedia.org/, https://leanpub.com/u/rdpeng


Mastering the art of Data Preparation II was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)