Web data acquisition: from database to dataframe for data analysis and visualization (Part 4)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The previous post described how the deeply nested JSON data on fligths were parsed and stored in an R-friendly database structure. However, looking into the data, the information is not yet ready for statistical analysis and visualization and some further processing is necessary before extracting insights and producing nice plots.
In the parsed batch, it is clearly visible the redundant structure of the data with the flight id repeted for each segment of each flight. This is also confirmed with the following simple check as the rows of the dataframe are more than the unique counts of the elements in the id column.
dim(data_items) [1] 397 15 length(unique(data_items$id)) 201 # real time changes of data could produce different results
This implies that the information of each segment of each flight has to be aggregated and merged in a dataset as single observations of a statistical analysis between, for example, price and distance. First, a unique primary key for each observation has to be used as reference variable to uniquely identify each element of the dataset.
library(plyr) # sql like functions library(readr) # parse numbers from strings data_items
The aggregation of mileage and price using the unique primary key allows to set up a dataframe ready for statistical analysis and data visualization. Current data tells us that there is a maximum of three segments in the connection between FCO and LHR with a minimum price of around EUR 122 and a median around EUR 600.
# descriptive statistics summary(data) # histogram price & distance g1
Of course, plenty of other analysis and graphical representations using flights features are possible given the large set of variables available in QPX Express API and the availability of data in real time.
To conclude the 4-step (flight) trip from data acquisition to data analysis, let’s recap the most important concepts described in each of the post:
3) Data parsing and structuring
4) Data analysis and visualization
That’s all folks!
#R #rstats #maRche #json #curl #qpxexpress #Rbloggers
This post is also shared in www.r-bloggers.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.