If you have not done so already, make sure you have R and Oracle setup as described in the previous article. If so, you should be able to create a connection from R and list the tables that are available in the HR schema.
The HR schema contains Human Resources data for a fictional company. The data consists of employee data, geographic areas, organizational departments and job position. An entity relationship diagram (ERD) of the schema is as follows:
Employees have salary data, so lets start with some simple univariate analysis.
It is possible to retrieve an entire table using sqlFetch. This seems to be the rather R way of thinking about work – a table corresponds with a data frame, get a data frame into R and manipulate the data from there.
Of course, this means that a large amount of unused data was accessed and retrieved over the network. This data was stored in memory by R. For simple examples using demonstration schemas, this is not an issue. In real world situations where the amount of data is growing significantly, it makes sense to reduce the amount of data being accessed and returned. It is better to request only the data that is of specific interest using an SQL query.
Are you surprised by the result? I was! Can you figure out the change required to the code above to sell the plot of salary against the index as in the previous chart? The solution – specifically reference the field being plotted.
The class of r$SALARY is a class of type integer. The class of r itself is a data frame. R functions often handle objects of various types without a problem – they simply do what you would expect. This is one small example of the benefit of paying closer attention to the object that you are manipulating. You also need to pay attention to the function in use. Consider the ggplot2 library qplot function.