Data analysis deals with different kinds of data.
For instance we can have supermarket sales with
– a transactional table, with customer ID, item ID, date of purchase
– an item table, with the item ID and its price
– a customer table, with customer ID and its anagraphic details (age, gender)
In this example data are tables with different structures.
In order to have a structured analysis framework, we can define how to treat each kind of data. They have different requirements and can be visualized in different ways.
Examples of requirements are positive prices for the item table and unique customer IDs in the customer table. The visualization of a transactional table can be a chart with the total daily number of purchases. For customers, it can be an histogram showing the age distribution.
There are also some relationships between tables. For instance, each customer ID in a transactional table should be in the customers table.
A good solution is using OOP (Object Oriented Programming). R has a particular structure for OOP, thought to ease data exploration.
There are some generic methods that can be applied to any class, like “plot”. It generates a different chart depending on the type of data. It displays the values of a numeric array in a simple chart. Instead, if applied to a data.frame with at least 3 columns, it generates a multiplot for any combination, like “pairs” method. Just type plot(iris) to see this example.
OOP allows to define a class for any type of table. These classes inherit all from data.table (or data.frame) and may be S3 or S4 classes (see the documentation). Each class may have conditions to check when data are loaded. In addition, it’s possible to redefine some methods, like “plot”. In the example, there are 3 objects: “transactions”, “items”, “customers”.
It’s also possible to put different tables together. S4 classes allow to create an object, similar to a list, containing different data. In the example, the object has 3 slots, containing the 3 tables. It’s possible to define how to check conditions between tables and visualizing all of them together.
There are different improvements that come from OOP. Code is more structured and easier to develop, understand, and share.
My next article will describe how to define the classes.