The field of data science has progressed from simple linear regression models to complex ensembling techniques but the most preferred models are still the simplest and most interpretable. Among them are regression, logistic, trees and naive bayes techniques. Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets. Naive bayes is a common technique used in the field of medical science and is especially used for cancer detection. This article explains the underlying logic behind naive bayes algorithm and example implementation.
We calculate probability as the proportion of cases where an event happens and call it the probability of the event. Just as there is probability for a single event, we have probability of a group of events as the proportion of cases where the group of events occur together. Another concept in probability is calculating the occurrence of events in a particular sequence, that is, if it is known that something has already happened, what will be the probability that another event happens after that. By logic, one can understand that we are narrowing down our scope to only the case when that something has already happened and then calculating the proportion of cases where our second event occurs. To represent it mathematically, If A is the first event and B is the second event, then P(B|A) is our desired probability of calculating probability of event A after occurrence of event B, P(A n B) is probability of the two events occurring together
P(B | A) = P(B) * P(A | B) / P(A)
This is the foundation pillar for Naive bayes algorithm. Owing to this, Naive Bayes can handle different kind of events which are differentiated by the probabilities of event separately, that is , P(B) and conditional probability P(B|A). If the two probabilities are same, then it means that the occurrence of event A had no effect on event B and the events are known as independent events. If the conditional probability becomes zero, then it means the occurrence of event A implies that event B cannot occur. If the reverse is also true, then the events are known as mutually exclusive events and the occurrence of only one of the events at a time is possible. All other cases are classified as dependent events where the conditional probability can be either lower or higher than the original. In real life, every coin toss is independent of all other coin tosses made previously and thus coin tosses are independent. The outcome of a single coin toss is composed of mutually exclusive events. We cannot have a head and the tails at the same time. When we consider runs of multiple coin tosses, we are talking about dependent events. For a combination of three coin tosses, the final outcome is dependent of the first, second as well as the third coin toss.
How do we Calculate these Probabilities?
It is easy to calculate the probability of a single event. It equates to the number of cases when the event occurs divided by the total number of possible cases. For instance, the probability of a 6 in a single six-faced die roll is ⅙ if all the sides have equal chance of coming. However, one needs to be careful when calculating probabilities of two or more events. Simply knowing the probability of each event separately is not enough to calculate the probability of multiple events happening. If we additionally know that the events are independent, then the probability of them occurring together is the multiplication of each event separately.
We denote this mathematically as follows:
P(A and B)=P(A)*P(B) – For independent events
As I already described, each coin toss is independent of other coin tosses. So the probability of having a Heads and a Heads combination in two coin tosses is
P(Heads-Heads Combo)=P(Heads in first throw)*P(Heads in second throw)=½ * ½ = ¼
If the events are not independent, we can use the probability of any one event multiplied by the probability of second event after the first has happened
P(A and B)=P(A)*P(B|A) – For dependent events
An example of dependent events can be drawing cards without replacement. If you want to know that the two cards drawn are King and Queen then we know that the probability of the first event is dependent of 52 cards whereas the probability of the second event is dependent on 51 cards.
Thus, P(King and Queen)=P(King)*P(Queen|King)
Here, P(King) is 4/52. After a King is drawn, there are 4 queens out of 51 cards.
So, P(Queen|King) is 4/51
P(King and Queen)=4/52*4/51=~0.6%
This is known as general multiplication rule. It also applies to the independent events scenario but since the events are independent, P(B|A) becomes equal to P(B)
The third case is for mutually exclusive events. If the events are mutually exclusive, we know that only one of the events can occur at a time. So the probability of the two events occurring together is zero. We are sometimes interested in probability of one of the events occuring and it is the sum of the individual probabilities in this scenario.
P(A OR B)=P(A)+P(B) – for mutually exclusive events
If we’re talking about a single six faced fair die throw, the probability of any two numbers occurring together is zero. In this case the probability of any prime number occuring is the sum of occurrence of each prime number. In this case P(2)+P(3)+P(5)
Had the events not been mutually exclusive, the probability of one of the events would have counted the probability of both events coming together twice. Hence we subtract this probability.
P(A OR B)=P(A)+P(B)-P(A AND B) – for events which are not mutually exclusive
In a single six faced fair dice throw, the probability of throwing a multiple of 2 or 3 describes a scenario of events which are not mutually exclusive since 6 is both a multiple of 2 and 3 and is counted twice.
Thus,
P(multiple of 2 or 3)=P(Multiple of 2)+P(Multiple of 3)- P(Multiple of 2 AND 3)
=P(2,4,6)+P(3,6)-P(6)=3/6 + 2/6 -1/6 = 4/6 =2/3
This is known as general addition rule and similar to the multiplication rule, it also applies to the mutually exclusive events scenario but in that case, P(A AND B) is zero.
This is all we need to understand how Naive Bayes algorithm works. It takes into account all such scenarios and learns accordingly. Let’s get our hands dirty with a sample dataset.
The reason that Naive Bayes algorithm is called Naive is not because it is simple or stupid. It is because the algorithm makes a very strong assumption about the data having features independent of each other while in reality, they may be dependent in some way. In other words, it assumes that the presence of one feature in a class is completely unrelated to the presence of all other features. If this assumption of independence holds, Naive Bayes performs extremely well and often better than other models. Naive Bayes can also be used with continuous features but is more suited to categorical variables. If all the input features are categorical, Naive Bayes is recommended. However, in case of numeric features, it makes another strong assumption which is that the numerical variable is normally distributed.
R supports a package called ‘e1071’ which provides the naive bayes training function. For this demonstration, we will use the classic titanic dataset and find out the cases which naive bayes can identify as survived.
The Titanic dataset in R is a table for about 2200 passengers summarised according to four factors – economic status ranging from 1st class, 2nd class, 3rd class and crew; gender which is either male or female; Age category which is either Child or Adult and whether the type of passenger survived. For each combination of Age, Gender, Class and Survived status, the table gives the number of passengers who fall into the combination. We will use the Naive Bayes Technique to classify such passengers and check how well it performs.
As we know, Bayes theorem is based on conditional probability and uses the formula
P(A | B) = P(A) * P(B | A) / P(B)
We now know how this conditional probability comes from multiplication of events so if we use the general multiplication rule, we get another variation of the theorem that is, using P(A AND B) = P(A) * P(B | A), we can obtain the value for conditional probability: P(B | A) = P(A AND B) / P(A) which is the variation of Bayes theorem.
Since P(A AND B) also equals P(B) * P(A | B), we can substitute it and get back the original formula
P(B | A) = P(B) * P(A | B) / P(A)
Using this for each of the features among Age, Gender and Economic status, Naive Bayes algorithm will calculate the conditional probability of survival of the combination
#Getting started with Naive Bayes #Install the package #install.packages(“e1071”) #Loading the library library(e1071) ?naiveBayes #The documentation also contains an example implementation of Titanic dataset #Next load the Titanic dataset data(“Titanic”) #Save into a data frame and view it Titanic_df=as.data.frame(Titanic)
We see that there are 32 observations which represent all possible combinations of Class, Sex, Age and Survived with their frequency. Since it is summarised, this table is not suitable for modelling purposes. We need to expand the table into individual rows. Let’s create a repeating sequence of rows based on the frequencies in the table
#Creating data from table repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination #Create the dataset by row repetition created Titanic_dataset=Titanic_df[repeating_sequence,] #We no longer need the frequency, drop the feature Titanic_dataset$Freq=NULL
The data is now ready for Naive Bayes to process. Let’s fit the model
#Fitting the Naive Bayes model Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset) #What does the model say? Print the model summary Naive_Bayes_Model
Naive Bayes Classifier for Discrete Predictors
Call: naiveBayes.default(x = X, y = Y, laplace = laplace) A-priori probabilities: Y No Yes 0.676965 0.323035 Conditional probabilities: Class Y 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Y Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Y Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122
The model creates the conditional probability for each feature separately. We also have the a-priori probabilities which indicates the distribution of our data. Let’s calculate how we perform on the data.
#Prediction on the dataset NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset) #Confusion matrix to check accuracy table(NB_Predictions,Titanic_dataset$Survived) NB_Predictions No Yes No 1364 362 Yes 126 349
We have the results! We are able to classify 1364 out of 1490 “No” cases correctly and 349 out of 711 “Yes” cases correctly. This means the ability of Naive Bayes algorithm to predict “No” cases is about 91.5% but it falls down to only 49% of the “Yes” cases resulting in an overall accuracy of 77.8%
Naive Bayes is a parametric algorithm which implies that you cannot perform differently in different runs as long as the data remains the same. We will, however, learn another implementation of Naive Bayes algorithm using the ‘mlr’ package. Assuming the same session is going on for the readers, I will install and load the package and start fitting a model
#Getting started with Naive Bayes in mlr #Install the package #install.packages(“mlr”) #Loading the library library(mlr)
The mlr package consists of a lot of models and works by creating tasks and learners which are then trained. Let’s create a classification task using the titanic dataset and fit a model with the naive bayes algorithm.
#Create a classification task for learning on Titanic Dataset and specify the target feature task = makeClassifTask(data = Titanic_dataset, target = "Survived") #Initialize the Naive Bayes classifier selected_model = makeLearner("classif.naiveBayes") #Train the model NB_mlr = train(selected_model, task)
The summary of the model which was printed in e3071 package is stored in learner model. Let’s print it and compare
#Read the model learned NB_mlr$learner.model Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.default(x = X, y = Y, laplace = laplace) A-priori probabilities: Y No Yes 0.676965 0.323035 Conditional probabilities: Class Y 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Y Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Y Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122
The a-priori probabilities and the conditional probabilities for the model are similar to the one calculated by e3071 package as was expected. This means that our predictions will also be the same.
#Predict on the dataset without passing the target feature predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3])) ##Confusion matrix to check accuracy table(predictions_mlr[,1],Titanic_dataset$Survived) No Yes No 1364 362 Yes 126 349
As we see, the predictions are exactly same. The only way to improve is to have more features or more data. Perhaps, if we have more features such as the exact age, size of family, number of parents in the ship and siblings then we may arrive at a better model using Naive Bayes.
In essence, Naive Bayes has an advantage of a strong foundation build and is very robust. I know of the ‘caret’ package which also consists of Naive Bayes function but it will also give us the same predictions and probability.
This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Vishnu Reddy and Saneesh Veetil contributed to this article.
Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.
Here is the Complete Code (used in this article):
#Getting started with Naive Bayes
#Install the package
#install.packages(“e1071”)
#Loading the library
library(e1071)
?naiveBayes #The documentation also contains an example implementation of Titanic dataset
#Next load the Titanic dataset
data("Titanic")
#Save into a data frame and view it
Titanic_df=as.data.frame(Titanic)
#Creating data from table
repeating_sequence=rep.int(seq_len(nrow(Titanic_df)), Titanic_df$Freq) #This will repeat each combination equal to the frequency of each combination
#Create the dataset by row repetition created
Titanic_dataset=Titanic_df[repeating_sequence,]
#We no longer need the frequency, drop the feature
Titanic_dataset$Freq=NULL
#Fitting the Naive Bayes model
Naive_Bayes_Model=naiveBayes(Survived ~., data=Titanic_dataset)
#What does the model say? Print the model summary
Naive_Bayes_Model
#Prediction on the dataset
NB_Predictions=predict(Naive_Bayes_Model,Titanic_dataset)
#Confusion matrix to check accuracy
table(NB_Predictions,Titanic_dataset$Survived)
#Getting started with Naive Bayes in mlr
#Install the package
#install.packages(“mlr”)
#Loading the library
library(mlr)
#Create a classification task for learning on Titanic Dataset and specify the target feature
task = makeClassifTask(data = Titanic_dataset, target = "Survived")
#Initialize the Naive Bayes classifier
selected_model = makeLearner("classif.naiveBayes")
#Train the model
NB_mlr = train(selected_model, task)
#Read the model learned
NB_mlr$learner.model
#Predict on the dataset without passing the target feature
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = Titanic_dataset[,1:3]))
##Confusion matrix to check accuracy
table(predictions_mlr[,1],Titanic_dataset$Survived)
In a recent post, I offered a definition of the distinction between data science and machine learning: that data science is focused on extracting insights, while machine learning is interested in making predictions. I also noted that the two fields greatly overlap:
I use both machine learning and data science in my work: I might fit a model on Stack Overflow traffic data to determine which users are likely to be looking for a job (machine learning), but then construct summaries and visualizations that examine why the model works (data science). This is an important way to discover flaws in your model, and to combat algorithmic bias. This is one reason that data scientists are often responsible for developing machine learning components of a product.
I’d like to further explore how data science and machine learning complement each other, by demonstrating how I would use data science to approach a problem of image classification. We’ll work with a classic machine learning challenge: the MNIST digit database.
The challenge is to classify a handwritten digit based on a 28-by-28 black and white image. MNIST is often credited as one of the first datasets to prove the effectiveness of neural networks.
In a series of posts, I’ll be training classifiers to recognize digits from images, while using data exploration and visualization to build our intuitions about why each method works or doesn’t. Like most of my posts I’ll be analyzing the data through tidy principles, particularly using the dplyr, tidyr and ggplot2 packages. In this first post we’ll focus on exploratory data analysis, to show how you can better understand your data before you start training classification algorithms or measuring accuracy. This will help when we’re choosing a model or transforming our features.
The default MNIST dataset is somewhat inconveniently formatted, but Joseph Redmon has helpfully created a CSV-formatted version. We can download it with the readr
package.
This dataset contains one row for each of the 60000 training instances, and one column for each of the 784 pixels in a 28 x 28 image. The data as downloaded doesn’t have column labels, but are arranged as “row 1 column 1, row 1 column 2, row 1 column 3…” and so on). This is a useful enough representation for machine learning. But as Jenny Bryan often discusses, we shouldn’t feel constricted by our current representation of the data, and for exploratory analysis we may want to make a few changes.
If your data's huge, analyze a small subset first to check approach. Like navigating the route on a bike before trying your 18-wheeler truck
— David Robinson (@drob) September 4, 2015
With that in mind, we’ll gather the data, do some arithmetic to keep track of x and y within an image, and keep only the first 10,000 training instances.
We now have one row for each pixel in each image. This is a useful format because it lets us visualize the data along the way. For example, we can visualize the first 12 instances with a couple lines of ggplot2.
We’ll still often return to the one-row-per-instance format (especially once we start training classifiers in future posts), but this is a fast way to understand and appreciate how the data and the problem is structured. In the rest of this post we’ll also polish this kind of graph (like making it black and white rather than a scale of blues).
Let’s get more comfortable with the data. From the legend above, it looks like 0 represents blank space (like the edges of the image), and a maximum around 255 represents the darkest points of the image. Values in between may represent different shades of “gray”.
How much gray is there in the set of images?
Most pixels in the dataset are completely white, along with another set of pixels that are completely dark, with relatively few in between. If we were working with black-and-white photographs (like of faces or landscapes), we might have seen a lot more variety. This gives us a hint for later feature engineering steps: if we wanted to, we could probably replace each pixel with a binary 0 or 1 with very little loss of information.
I’m interested in how much variability there is within each digit label. Do all 3s look like each other, and what is the “most typical” example of a 6? To answer this, we can find the mean value for each position within each label, using dplyr’s group_by
and summarize
.
We visualize these average digits as ten separate facets.
These averaged images are called centroids. We’re treating each image as a 784-dimensional point (28 by 28), and then taking the average of all points in each dimension individually. One elementary machine learning method, nearest centroid classifier, would ask for each image which of these centroids it comes closest to.
Already we have some suspicions about which digits might be easier to separate. Distinguishing 0 and 1 looks pretty straightforward: you could pick a few pixels at the center (always dark in 1 but not 0), or at the left and right edges (often dark in 0 but not 1), and you’d have a pretty great classifier. Pairs like 4/9, or 3/8, have a lot more overlap and will be a more challenging problem.
Again, one of the aspects I like about the tidy approach is that at all stages of your analysis, your data is in a convenient form for visualization, especially a faceted graph like this one. In its original form (one-row-per-instance) we’d need to do a bit of transformation before we could plot it as images.
So far, this machine learning problem might seem a bit easy: we have some very “typical” versions of each digit. But one of the reasons classification can be challenging is that some digits will fall widely outside the norm. It’s useful to explore atypical cases, since it could help us understand why the method fails and help us choose a method and engineer features.
In this case, we could consider the Euclidean distance (square root of the sum of squares) of each image to its label’s centroid.
Measured this way, which digits have more variability on average?
It looks like 1s have especially low distances to their centroid: for the most part there’s not a ton of variability in how people draw that digit. It looks like the most variability by this measure are in 0s and 2s. But every digit has at least a few cases with an unusually large distance from their centroid. I wonder what those look like?
To discover this, we can visualize the six digit instances that had the least resemblance to their central digit.
This is a useful way to understand what kinds of problems the data could have. For instance, while most 1s looked the same, they could be drawn diagonally, or with a flat line and a flag on top. A 7 could be drawn with a bar in the middle. (And what is up with that 9 on the lower left?)
This also gives us a realistic sense of how accurate our classifier can get. Even humans would have a hard time classifying some of these sloppy digits, so we can’t expect a 100% success rate. (Conversely, if one of our classifiers does get a 100% success rate, we should examine whether we’re overfitting!).
So far we’ve been examining one digit at a time, but our real goal is to distinguish them. For starters, we might like to know how easy it is to tell pairs of digits apart.
To examine this, we could try overlapping pairs of our centroid digits, and taking the difference between them. If two centroids have very little overlap, this means they’ll probably be easy to distinguish.
We can approach this with a bit of crossing()
logic from tidyr:
Pairs with very red or very blue regions will be easy to classify, since they describe features that divide the datasets neatly. This confirms our suspicion about 0/1 being easy to classify: it has substantial regions than are deeply red or blue.
Comparisons that are largely white may be more difficult. We can see 4/9 looks pretty challenging, which makes sense (a handwritten 4 and 9 really differ only by a small region at the top). 7/9 shows a similar challenge.
MNIST has been so heavily studied that we’re unlikely to discover anything novel about the dataset, or to compete with the best classifiers in the field. (One paper used deep neural networks to achieve about 99.8% accuracy, which is about as well as humans perform).
But a fresh look at a classic problem is a great way to develop a case study. I’ve been exploring this dataset recently, and I’m excited by how it can be used to illustrate a tidy approach to fundamental concepts of machine learning. In the rest of this series I’d like to show how we could demonstrate how we would train classifiers such as logistic regression, decision trees, nearest neighbor, and neural networks. In each case we’ll use tidy principles to demonstrate the intuition and insight behind the algorithm.
In any model development exercise, a considerable amount of time is spent in understanding the underlying data, visualizing relationships and validating preliminary hypothesis (broadly categorized as Exploratory data Analysis). A key element of EDA involves visually analyzing the data to glean valuable insights and understand underlying relationships & patterns in the data.
While EDA is defined more as a philosophy rather than a defined set of procedures and techniques, there is a certain set of standard analysis that you would most likely perform as part of EDA to gain an initial understanding of the data.
This post provides an overview of a package RtutoR
that I had developed some time back and have recently added a few new functionalities to automate some elements of EDA. In nutshell, the functionalities provided in the package would help you:
ReporterRs
package) containing all the plots and summary tables generated by invoking the functionTo understand the package functionalities, let’s look at a simple example. We will consider the Titanic dataset for this example (Most of you should be familiar with this dataset. This is a commonly used practice problem in Kaggle and the dataset can be downloaded from here). The problem statement is to predict the likelihood of a passenger surviving the Titanic disaster given a set of attributes such as Passenger Age, Gender, Fare price etc.
Before we proceed with building a model, we first try to gain a better understanding of the underlying data. This would generally include analysis such as :
To perform this analysis, simply invoke the generate_exploratory_analysis_ppt
function from the package as follows:
# The function takes 3 mandatory arguments # df - A dataframe object to be analyzed # target_var - The name of the target (or dependent) variable # output_file_name = Output file name and destination path (output file is a PowerPoint deck - if only file name is provided, the file is saved in the current working directory) library(RtutoR) df = read.csv("train.csv") # Load the Titanic dataset res = generate_exploratory_analysis_ppt(df,target_var = "Survived", output_file_name = "titanic_exp_report.pptx") # If you wish to quickly test the functionality without downloading any external dataset, you could test it on one of the inbuilt datasets in R. # For eg., using the iris dataset res = generate_exploratory_analysis_ppt(df = iris,target_var = "Species", output_file_name = "iris_report.pptx")
This function generates an Exploratory analysis PowerPoint Report (titled, titanic_exp_report.pptx) comprising of various different plots and related summary tables (Univariate and Bi-variate analysis output). The demo example output report is available in the Github repository for this project and can be downloaded from here.
An output object (named res in the example above) is also generated – This object can be used to view the plots and tables on the R console (and to use it for any subsequent analysis)
To view the contents of res
object, simply type:
names(res) # This will return "univar" "bivar"
Basically, the output object contains the univariate and bi-variate analysis output. Analysis output is further categorized into plots
and tables
(as can be seen if we type names(res$univar)
or names(res$bivar))
With the iris dataset example, all available univariate plots can be listed as follows:
names(res$univar$plots) "Histogram - Sepal.Length" "Histogram - Sepal.Width" "Histogram - Petal.Length" "Histogram - Petal.Width"
If you trying to run the example above, you may encounter couple of possible errors:
Error : .onLoad failed in loadNamespace() for ‘rJava’, details...
ReporteRs
library that the RtutoR
package uses to generate ppt, relies on rJava
and this error is caused by a mismatched Java version (e.g. you are running 32 bit Java on a 64 bit machine). You can read more about the issue here In my case, I was able to resolve this issue by downloading and installing a 64 bit version of Java (compatible with my 64 bit machine)
You may get an Error: Invalid column specification
error. This may be due to an older version of tidyr
installed on your machine. This error should be resolved if you upgrade to the latest version of tidyr
– You could also simply remove the package and perform a fresh install. (This is actually a bug and caused by not specifying the correct version of the dependent package. I will be fixing it in the next release so that a manual upgrade is not required)
As can be seen from the output report, the key idea behind this function is to automate some of the standard exploratory and visual analysis that’s performed as part of EDA. The function needs only 3 mandatory arguments to run (There are other arguments with default values that you can alter, which we will come to later):
df
–> The dataset nametarget_var
–> The target variable (or the dependent variable)output_file_name
–> The output file name for the PowerPoint deck that is generated by the function. If only the file name is provided, the output powerpoint file is saved in the current working directory. To save to a specific folder, provide the full file path. The specific plots and the summary analysis generated by the function, depends on the datatype of the feature and target variable:If the feature is numeric, a histogram is plotted along with a five number summary table. If the feature is categorical, a bar plot is plotted indicating the count of different unique values (or factor levels) for that feature.
The specific plot that is used to plot the relationship between the Target variable and the feature again depends on the data types:
Depending on the data type(s), the function uses commonly used/recommended plot types, with default settings (e.g. default bin size in the case of histograms). Depending on the nature of your data and specific requirements, additional analysis and plots may be required – For e.g. you may wish to change the bin size for Histograms, change the default smoothing function being used (in the case of scatter plots) or use a different plot to visualize relationship (for e.g. instead of a box plot, overlapping histograms or density plots can be used). The current version of the package does not support changing the default plots and settings. Hence, custom plots if required would need to separately created. However, the package also contains a plotting app that provides an automated interface for some of the most commonly used plots and functionalities in ggplot2. I had previously blogged about this functionality here & here.
Now that we have seen how the basic version of the function works, let’s look at some of the other arguments with default values that you can override. These are:
n_plots_per_slide
: You can control the number of plots (with corresponding tables) to be included in each slide. There are 2 option – “1” or “2” with the default value being 2. If this is changed to 1, only 1 plot (and related table) is displayed on each slide.plot_theme
: ggplot2 is used for generating all the plots that the function generates. ggplot2 has a few different plot themes to choose from, Additionally, the ggthemes package provides a whole set of different plot themes.By default, the function uses the theme_fivethirtyeight() from ggthemes. (I will talk more about the different themes available and how to choose one, when we discuss the Shiny app version of this function)top_k_features
: If there are lot of features in your dataset, the resulting analysis and the output PowerPoint report can be massive. For eg, if there are 100 features in your dataset (and one target variable), the final output will include 200 plots and related tables (100 for Univariate and 100 for bi-variate analysis). To reduce the number of features to be considered for analysis, a simple feature screening can be performed. By default this argument is set to NULL. if feature screening is required, simply provide an integer value to this argument, indicating the number of features to be used for analysis.f_screen_model
: This argument is relevant only if a non-null value is provided to the top_f_features argument
. There are various different techniques for feature screening broadly classified under (a) Filter methods and (b) Wrapper methods. The function uses few of the filter methods available as part of the FSelector package. There are 4 different filtering methods provided – chi.squared
, information.gain
, gain.ratio
, symmetrical.uncertainty
(You can go through the FSelector
documentation for more information on these filtering methods)max_levels_cat_var
: For categorical predictors, if the no of factor levels (or unique values for character class) is greater than this argument, the variable is omitted from the analysis (The default value is 10). In our Titanic dataset example, you will see that the Passenger name variable is not included in the output because the factor levels is > 10. The default value of 10 can be changed if required, by specifying a different value to this argument.group_names
: Often times, while plotting the relationship between the Target variable and your feature, we would also like to visualize the relationship on some additional grouping variable. For eg. we wish to visualize the relationship between Survival Rates and Age grouped as per the Gender of the passenger. This can be done by specifying additional grouping variable (or multiple grouping variables) to this argument. Please note that this would generate all possible combinations of Target Variable and Features with the grouping variable (for eg. if there are 10 features, excluding the grouping variable, 10 new plots would be created). If the number of such combinations are large, you may wish to do an initial feature selection to reduce the number of features considered for analysis.Now let’s re-invoke the function we saw previously, by altering some of the default argument values.
library(RtutoR) res = generate_exploratory_analysis_ppt(df,target_var = "Survived", top_k_features = 5, f_screen_model = "information.gain", group_name = "Sex", output_file_name = "titanic_exp_report_2.pptx")
Compared to the previous version, there are 2 key changes that we have made:
You can download the output PowerPoint report that this function generates, here (A screenshot is provided below for easy reference)
The package also provides a Shiny App version of the function. You can view a demo of the app here.
To launch the app simply invoke the following function:
gen_exploratory_report_app(df)
Only the dataframe needs to be passed as an argument to the function. All other argument values (target_var, plot_themes etc) are provided as Input fields in the app. Once the argument values are selected or entered (and default argument values changed, if required) the Generate Report button should be clicked to generate the exploratory analysis ppt.
The app version of the function makes it easier to view the list of valid options for each argument – For e.g you can view the list of all available plot themes.
However, the key advantage of using the app version to generate the report, is the Interactive plot output that the app provides. Once the report is generated, individual plots can be displayed on the Plot Output pane. Dropdown options are provided to select any particular plot from the set of output plots.
However, instead of the static plots that the PowerPoint report contains, the plots displayed on the app are interactive in nature (using the plotly
library for interactivity). This makes it ideal for exploratory analysis – For e.g. you can zoom in on a section of the plot, switch on/off specific levels in a categorical variable and so on.
I hope that the package and the functionalities that it provides help meet the desired objective of enhancing the speed and efficiency with which data analysis can be performed so that more time can be spent on value adding & hard-to-automate tasks such as Feature Engineering, Insights generation, Story-telling etc.
Do give the package a try and in case of any issues do feel free to highlight it the Issues section of the Github repository for this project.
Related Post
Another Rblpapi release, now at version 0.3.8, arrived on CRAN yesterday. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required).
This is the eight release since the package first appeared on CRAN in 2016. This release wraps up a few smaller documentation and setup changes, but also includes an improvement to the (less frequently-used) subscription mode which Whit cooked up on the weekend. Details below:
Changes in Rblpapi version 0.3.8 (2018-01-20)
The 140 day limit for intra-day data histories is now mentioned in the
getTicks
help (Dirk in #226 addressing #215 and #225).The Travis CI script was updated to use
run.sh
(Dirk in #226).The
install_name_tool
invocation under macOS was corrected (@spennihana in #232)The
blpAuthenticate
help page has additional examples (@randomee in #252).The
blpAuthenticate
code was updated and improved (Whit in #258 addressing #257)The jump in version number was an oversight; this should have been 0.3.7.
And only while typing up these notes do I realize that I fat-fingered the version number. This should have been 0.3.7. Oh well.
Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
Making it even easier to download and organize stock prices from Yahoo Finance –
I just released a long due update to package BatchGetSymbols
. The
files are under review in CRAN and you should get the update soon.
Meanwhile, you can install the new version from Github:
if (!require(devtools)) install.packages('devtools')
devtools::install_github('msperlin/BatchGetSymbols')
The main innovations are:
Clever cache system: By default, every new download of data will
be saved in a local file located in a directory chosen by user.
Every new request of data is compared to the available local
information. If data is missing, the function only downloads the
piece of data that is missing. This make the call to function
BatchGetSymbols
a lot faster! When updating an existing dataset of
prices, the function only downloads the new available data that is
missing from the local files.
Returns calculation: Function now returns a return vector in
df.tickers
. Returns are used a lot more than prices in research.
No reason why they should be keep out of the output.
Wide format: Added function for converting data to the wide
format. In some situations, such as portfolio analysis, the wide
format makes a lot of sense and is required for some methodologies.
Ibovespa composition: Added function for downloading current
Ibovespa composition directly from Bovespa website.
In the next chunks of code I show some of the innovations:
library(BatchGetSymbols)
## Loading required package: rvest
## Loading required package: xml2
# download Ibovespa stocks
my.tickers <- GetSP500Stocks()$tickers[1:10] # lets keep it light
# set dates
first.date <- '2016-01-01'
last.date <- '2018-01-01'
# set folder for cache system
my.temp.cache.folder <- 'BGS_CACHE'
# get data and time it
time.nocache <- system.time({
my.l <- BatchGetSymbols(tickers = my.tickers, first.date, last.date,
cache.folder = my.temp.cache.folder, do.cache = FALSE)
})
##
## Running BatchGetSymbols for:
## tickers = MMM, ABT, ABBV, ACN, ATVI, AYI, ADBE, AMD, AAP, AES
## Downloading data for benchmark ticker
## MMM | yahoo (1|10) - You got it!
## ABT | yahoo (2|10) - Nice!
## ABBV | yahoo (3|10) - Looking good!
## ACN | yahoo (4|10) - Got it!
## ATVI | yahoo (5|10) - Good job!
## AYI | yahoo (6|10) - Looking good!
## ADBE | yahoo (7|10) - Fells good!
## AMD | yahoo (8|10) - Good job!
## AAP | yahoo (9|10) - Youre doing good!
## AES | yahoo (10|10) - Well done!
time.withcache <- system.time({
my.l <- BatchGetSymbols(tickers = my.tickers, first.date, last.date,
cache.folder = my.temp.cache.folder, do.cache = TRUE)
})
##
## Running BatchGetSymbols for:
## tickers = MMM, ABT, ABBV, ACN, ATVI, AYI, ADBE, AMD, AAP, AES
## Downloading data for benchmark ticker | Found cache file
## MMM | yahoo (1|10) | Found cache file - You got it!
## ABT | yahoo (2|10) | Found cache file - Mais faceiro que guri de bombacha nova!
## ABBV | yahoo (3|10) | Found cache file - OK!
## ACN | yahoo (4|10) | Found cache file - Looking good!
## ATVI | yahoo (5|10) | Found cache file - Youre doing good!
## AYI | yahoo (6|10) | Found cache file - Good job!
## ADBE | yahoo (7|10) | Found cache file - Boa!
## AMD | yahoo (8|10) | Found cache file - Youre doing good!
## AAP | yahoo (9|10) | Found cache file - Nice!
## AES | yahoo (10|10) | Found cache file - Well done!
cat('\nTime with no cache:', time.nocache['elapsed'])
##
## Time with no cache: 5.721
cat('\nTime with cache:', time.withcache['elapsed'])
##
## Time with cache: 0.419
Now let’s check the default output with data in the long format:
dplyr::glimpse(my.l)
## List of 2
## $ df.control:'data.frame': 10 obs. of 6 variables:
## ..$ ticker : Factor w/ 10 levels "MMM","ABT","ABBV",..: 1 2 3 4 5 6 7 8 9 10
## ..$ src : Factor w/ 1 level "yahoo": 1 1 1 1 1 1 1 1 1 1
## ..$ download.status : Factor w/ 1 level "OK": 1 1 1 1 1 1 1 1 1 1
## ..$ total.obs : int [1:10] 503 503 503 503 503 503 503 503 503 503
## ..$ perc.benchmark.dates: num [1:10] 1 1 1 1 1 1 1 1 1 1
## ..$ threshold.decision : Factor w/ 1 level "KEEP": 1 1 1 1 1 1 1 1 1 1
## $ df.tickers:'data.frame': 5030 obs. of 10 variables:
## ..$ price.open : num [1:5030] 148 147 146 143 141 ...
## ..$ price.high : num [1:5030] 148 148 146 143 142 ...
## ..$ price.low : num [1:5030] 145 146 143 141 140 ...
## ..$ price.close : num [1:5030] 147 147 144 141 140 ...
## ..$ volume : num [1:5030] 3277200 2688100 2997100 3553500 2664000 ...
## ..$ price.adjusted : num [1:5030] 140 140 137 134 134 ...
## ..$ ref.date : Date[1:5030], format: "2016-01-04" ...
## ..$ ticker : chr [1:5030] "MMM" "MMM" "MMM" "MMM" ...
## ..$ ret.adjusted.prices: num [1:5030] NA 0.00436 -0.02014 -0.02436 -0.0034 ...
## ..$ ret.closing.prices : num [1:5030] NA 0.00436 -0.02014 -0.02436 -0.0034 ...
And change the format of the long dataframe to wide:
l.wide <- reshape.wide(my.l$df.tickers)
Now we check the matrix of prices:
print(head(l.wide$price.adjusted))
## ref.date AAP ABBV ABT ACN ADBE AES AMD
## 1 2016-01-04 151.7778 53.13200 40.73167 97.84000 91.97 8.676987 2.77
## 2 2016-01-05 150.7410 52.91066 40.72218 98.34921 92.34 8.796606 2.75
## 3 2016-01-06 146.7531 52.91988 40.38061 98.15706 91.02 8.492956 2.51
## 4 2016-01-07 148.3781 52.76309 39.41285 95.27461 89.11 8.281322 2.28
## 5 2016-01-08 145.1181 51.32436 38.58740 94.35222 87.85 8.400942 2.14
## 6 2016-01-11 146.6036 49.69193 38.64433 95.34187 89.38 8.308928 2.34
## ATVI AYI MMM
## 1 37.08925 231.8706 139.7082
## 2 36.61602 234.8136 140.3171
## 3 36.27096 228.6194 137.4910
## 4 35.75830 222.5544 134.1415
## 5 35.20620 214.2424 133.6848
## 6 35.69915 205.1450 133.6562
and matrix of returns:
print(head(l.wide$ret.adjusted.prices))
## ref.date AAP ABBV ABT ACN
## 1 2016-01-04 NA NA NA NA
## 2 2016-01-05 -0.006831368 -0.004165926 -0.0002329146 0.005204589
## 3 2016-01-06 -0.026455014 0.000174256 -0.0083877625 -0.001953793
## 4 2016-01-07 0.011073327 -0.002962743 -0.0239660045 -0.029365662
## 5 2016-01-08 -0.021971161 -0.027267849 -0.0209438784 -0.009681414
## 6 2016-01-11 0.010236201 -0.031806010 0.0014754559 0.010488932
## ADBE AES AMD ATVI AYI
## 1 NA NA NA NA NA
## 2 0.004022997 0.01378578 -0.007220217 -0.012759168 0.01269226
## 3 -0.014294987 -0.03451900 -0.087272727 -0.009423798 -0.02637922
## 4 -0.020984356 -0.02491877 -0.091633466 -0.014134199 -0.02652876
## 5 -0.014139861 0.01444455 -0.061403509 -0.015439800 -0.03734795
## 6 0.017416039 -0.01095282 0.093457944 0.014001682 -0.04246338
## MMM
## 1 NA
## 2 0.0043588215
## 3 -0.0201408824
## 4 -0.0243618296
## 5 -0.0034048077
## 6 -0.0002134424
In war, truth is the first casualty (Aeschylus)
I am not a pessimistic person. On the contrary, I always try to look at the bright side of life. I also believe that living conditions are now better than years ago as these plots show. But reducing the complexity of our world to just six graphs is riskily simplistic. Our world is quite far of being a fair place and one example is the immigration drama.
Last year there were 934 incidents around the world involving people looking for a better life where more than 5.300 people lost their life o gone missing, 60% of them in Mediterranean. Around 8 out of 100 were children.
The missing migrant project tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide. You can find a huge amount of figures, plots and information about this scourge in their website. You can also download there a historical dataset with information of all these fatal journeys, including location, number of dead or missing people and information source from 2015 until today.
I this experiment I read the dataset and do some plots using highcharter
; you can find a link to the R code at the end of the post.
This is the evolution of the amount of deaths or missing migrants in the process of migration towards an international destination from January 2015 to December 2017:
The Mediterranean is the zone with the most incidents. To see it more clearly, this plot compares Mediterranean with the rest of the world, grouping previous zones:
Is there any pattern in the time series of Mediterranean incidents? To see it, I have done a LOESS decomposition of the time series:
Good news: trend is decreasing for last 12 months. Regarding seasonal component, incidents increase in April and May. Why? I don’t know.
This is a map of the location of all incidents in 2017. Clicking on markers you will find information about each incident:
Every of us should try to make our world a better place. I don’t really know how to do it but I will try to make some experiments during this year to show that we have tons of work in front of us. Meanwhile, I hope this experiment is useful to give visibility to this humanitarian disaster. If someone wants to use the code, the complete project is available in GitHub.
Last week, Marc Cohen from Google Cloud was on campus to give a hands-on workshop on image classification using TensorFlow. Consequently, I spent most of my time thinking about how I can incorporate image classifiers in my work. As my research is primarily on forecasting armed conflict duration, it’s not really straightforward to make a connection between the two. I mean, what are you going to do, analyse portraits of US presidents to see whether you can predict military use of force based on their facial features? Also, I’m sure someone, somewhere has already done that, given this.
For the purposes of this blog post, I went ahead with the second most ridiculous idea that popped into my mind: why don’t I generate images from my research and use them to answer my own research question? This is a somewhat double-edged sword situation; I want to post cool stuff here, but at the same time I’m not looking forward to explaining my supervisor how a bunch of images can predict conflict duration better than my existing models and why it took me three and a half years to figure this out. Academia.
But fret not; if this was a journal article, the abstract would be short and sweet: No. As in, literally. Expect no glory at the end of this post, you can’t predict conflict duration using images. Well, I can’t, anyway. Consider this an exercise in data science with R. We are going to use the keras
library, which in turn (amongst others) utilises TensorFlow.
Today’s undertaking is a bit convoluted—no, I’m not setting you up for an eventual neural network joke—we first need to construct an image dataset, and then basically de-construct it into a tensor. Tensors are multidimensional arrays, which may not be immensely helpful definition if you’re like me and thought all arrays are multidimensional (i.e. scalar > vector > matrix > array). But I digress, I’m sure it’s a case of my maths failing me. How do we go on about creating images to train our model? Well, the simplest option I could think of was getting a hold of an event-count dataset and extract density kernels of each conflict. Say, the number of incidents of political violence over time.
I will use the Uppsala Conflict Data Program (UCDP) Geo-Referenced Dataset (GED) for this task. We don’t need the geo-spatial variables but this is one of the datasets I’m most familiar with. The temporal coverage is 1989-2016. We will first filter for state-based violence, one of the three categories:
ucdp <- read.csv("ged171.csv", stringsAsFactors = FALSE)
dim(ucdp)
## [1] 135181 32
#Filter state-based conflicts
ucdp <- ucdp[ucdp$type_of_violence == 1, ]
Instead of using the whole dataset, which has around 135K observations, we will use a much smaller subset consisting of conflict episodes. These will be ‘active streaks’ of violence, meaning they have been going on for more than a calendar year and had at least 25 battle-related deaths. This is important, primarily because we don’t want the model to learn the characteristics of ‘finished’ conflicts and only be able to predict ex-post. What we want instead is to identify patterns present around the time of onset, so that we can make predictions closer to the onset of conflict as possible. We can identify such consecutive occurrences using the data.table
library by passing a rleid
argument:
#Get active periods
active <- ucdp %>%
group_by(dyad_new_id, year) %>%
dplyr::summarise(active = first(active_year))
dim(active)
## [1] 2091 3
setDT(active)
active <- active[, if (first(active) == 1) .(year = first(year), duration = .N),
by = .(dyad_new_id, cons = rleid(dyad_new_id, active))][, !"cons"]
head(active)
## dyad_new_id year duration
## 1: 406 1990 1
## 2: 406 1993 1
## 3: 406 1996 1
## 4: 411 1989 7
## 5: 411 1997 1
## 6: 411 1999 18
For example, we see that dyad #411 had three conflict episodes in the dataset: first in 1989 that lasted 7 years (including 1989 so active until 1995), a single-year in 1997, and a final one that began in 1999 and was active as of 2016. The newly created duration variable is our outcome; we want to predict duration (of the conflict episode) based on some characteristics in year (of onset). This is why I didn’t want to call our predictions ex-ante; we will still need to wait a year to collect the required information. At this point, we should also decide whether we want to tackle a binary classification, a multiple-classification, or a regression problem:
table(active$duration)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 25
## 209 100 46 33 18 15 11 5 4 4 4 4 3 3 3 1 3 1
## 27
## 2
I don’t think anything other than binary classification is feasible given our n, class distribution, and the quality of the data. And that assumes binary classification is feasible in the first place. Let’s try to predict whether the conflict will go on after its first year and recode the outcome variable:
active$duration <- ifelse(active$duration == 1, 0, 1)
You might be thinking, what’s the value of predicting second-year continuation if we can only make predictions after the first year? Well, as you can see, single-year conflicts make up about half of our cases. Given the years are calendar years and not full twelve month periods—as in, if there are 25 battle-related deaths in December, that dyad is active for the whole year—it would be useful to forecast whether it will go on or not. Moving on, let’s split our data into training and test with a 80/20 ratio using caret
and assign the outcomes:
trainIndex <- createDataPartition(active$duration, p = .8, list = FALSE, times = 1)
dataTrain <- active[ trainIndex, ]
dataTest <- active[-trainIndex, ]
y_train <- dataTrain$duration
y_test <- dataTest$duration
#Using keras library, transform into two categories
y_train <- to_categorical(y_train, 2)
y_test <- to_categorical(y_test, 2)
Before going any further, I want to illustrate what we will be using to train our model:
#Lubridate
ucdp$date_start <- dmy(ucdp$date_start)
ggplot(aes(x = date_start), data = ucdp[ucdp$dyad_new_id == 406 & ucdp$year == 1990, ]) +
geom_density(fill = "black") +
theme_void()
Just the (shadow?) event count density during the onset year. No annotations, no axes. Only a black and white square image (more on that in a bit). The challenge is that whether the peaks and curves and all contain enough information to differentiate single-year conflicts from their multi-year counterparts. Now, we will create these plots programmatically:
for (i in 1:nrow(active)) {
p <- ggplot(aes(x = date_start), data = ucdp[ucdp$dyad_new_id == active$dyad_new_id[i] & ucdp$year == active$year[i], ]) +
geom_density(fill = "black") + theme_void()
ggsave(p, file = paste0(paste(active$dyad_new_id[i], active$year[i], active$duration[i], sep = "_"), ".png"),
width = 1, height = 1, units = "in", path = "dens")
}
Running the above chunk will create a folder called ‘dens’ in your current working directory and populate it with 469 plots. The naming convention is dyad.id_onset.year_duration.png
. The size is set to 1 x 1 inches, which is a lot (matrix multiplication, people). You should be able to call image_resize_array
via keras
, however that didn’t work for me so I resorted to Photoshop. You can record key strokes in Photoshop and process a whole folder full of images just like ours. So I resized all plots to 28 x 28 pixels and converted them to greyscale. The latter saved us three dimensions; three RGB plus one alpha are reduced to one grey channel only. The whole process took around 45 seconds on my machine, however YMMV. Our mini attempt at creating modern art using R will look like this:
Or this, if we used geom_bar
instead:
Okay, now we have to get the images into a dataframe so that we have their numerical representation. We can just reverse-engineer our plot-saving solution:
x <- list()
for (i in 1:nrow(active)) {
t <- readPNG(paste0("hist/", paste(active$dyad_new_id[i], active$year[i], active$duration[i], sep = "_"), ".png"))
t <- t[,,1]
x[[i]] <- (t)
}
We first create an empty list outside the loop. Then, similar to the first loop, we go through every element of our active episodes dataframe and read in the .png files using the readPNG
function of the png
library. It would have been easier to just construct the names with a counter such as seq()
earlier but I wanted to be able to verify quickly whether the loop worked or not.
#Read images from the loop list
load("x.RData")
images <- image_to_array(x)
dim(images)
## [1] 469 28 28
#Reshape array into nrow times 784
images <- array_reshape(images, c(nrow(images), 784))
dim(images)
## [1] 469 784
#Split train/test
x_train <- images[trainIndex, ]
x_test <- images[-trainIndex, ]
We have finally reached the fun part, Batman. I’m not the most NN-savvy person around so I will not pretend and try to lecture you. If you are a beginner, RStudio has a pretty neat guide and a cheatsheet to get you started. In a nutshell, we initialise our model by calling keras_model_sequential
, and construct the network structure by specifying layers in order. The first layer must feature the input_shape
, which in turn must match the dimensions of our array. We also need to specify activation functions, which there are about ten to choose from. The units represent the dimensionality of their output space. The dropout layers in between minimise the risk of overfitting by excluding the specified amount from being used in training so that the network does not co-adapt too much. We don’t use them here, but you can also add regularisation, pooling, and convolution layers that go from 1d to 3d. When you have the default arguments filled, Keras automatically connects the layers:
model <- keras_model_sequential()
model %>%
layer_dense(units = 1024, activation = "sigmoid", input_shape = c(784)) %>%
layer_dropout(rate = .4) %>%
layer_dense(units = 512, activation = "sigmoid") %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 256, activation = "sigmoid") %>%
layer_dropout(rate = .2) %>%
layer_dense(units = 2, activation = "softmax")
We can get the current structure of our model by calling summary(model)
:
summary(model)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 1024) 803840
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 1024) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 512) 524800
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 512) 0
## ___________________________________________________________________________
## dense_3 (Dense) (None, 256) 131328
## ___________________________________________________________________________
## dropout_3 (Dropout) (None, 256) 0
## ___________________________________________________________________________
## dense_4 (Dense) (None, 2) 514
## ===========================================================================
## Total params: 1,460,482
## Trainable params: 1,460,482
## Non-trainable params: 0
## ___________________________________________________________________________
Seven lines of code equals nearly 1.5M parameters. Whoa. To be honest, we don’t need three layers here at all but because our n is so small, might as well try our luck. Before running the model, you should also supply three functions to optimise (gain), minimise (loss) and to quantify performance (accuracy):
model %>% compile(
optimizer = optimizer_adamax(),
loss = loss_binary_crossentropy,
metrics = metric_binary_accuracy)
As with activation functions, there are several options for the each of the above; refer to the RStudio guide cited earlier to get a sense of what’s what. One should be able to set seed for Keras in R with use_session_with_seed(seed)
, as I have at the beginning of this post, however I can definitely tell you that it does not consistenly work (also see issue#42, issue#119, and issue#120). So no promise of perfect reproducibility. Running the below will result in 100 runs over the whole training data, using 256 samples simultaneously in each iteration, with a 70/30 train/test split for in-sample validation:
history <- model %>% fit(
x_train, y_train,
epochs = 100, batch_size = 256,
validation_split = .3)
#Default is ggplot so we can tweak it easily
plot(history) + theme(legend.position = c(.9, .9))
What’s up with the gatekeeping jargon? If you run the above chuck live, you’ll find that (in RStudio at least) you get a nice plot that automatically updates itself at the end of each epoch. I can tell you anecdotally that cheering for your neural network adds around +-3% accuracy on average. Finally, we can evaluate our model using the test data and extract out-of-sample predictions:
model %>% evaluate(x_test, y_test)
## $loss
## [1] 0.6687286
##
## $binary_accuracy
## [1] 0.6021505
model %>% predict_classes(x_test)
## [1] 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1
Apparently, a bunch of pixelated greyscale images can predict with 60% accuracy whether the conflict will be active next year. Note that 60% is nothing if you are used to MNIST data in which anything can get 99% accuracy without breaking a sweat. However, the more social-sciency issues you deal with, the lesser the precision. Up until a couple of years ago, the best predictive work in conflict research had around ~67% accuracy. With that said, we see that the model more or less predicted 1’s all over the board, so it could be that we just got a lazy model that looks a bit more organic than it actually is. I would have liked to finish on a ‘more research is needed’ note, but probably not.
One of the great features of R is the possibility to quickly access web-services. While some companies have the habit and policy to document their APIs, there is still a large chunk of undocumented but great web-services that help the regular data scientist.
In the following short post, I will show how we can turn a simple web-serivce in a nice R-function.
The example I am going to use is the linguee translation service: DeepL.
Just as google translate, Deepl features a simple text field. When a user types in text, the translation appears in a second textbox. Users can choose between the languages.
In order to see how the service works in the backend, let’s have a quick look at the network traffic.
For that we open the browser’s developer tools and jump to the network tab. Next, we type in a sentence and see which requests (XHR) are made. The interface repeatedly sends JSON requests to the following endpoint: “https://www.deepl.com/jsonrpc”.
Looking at a single request we can quickly identify the parameters that we typed in (grey area, in the lower right corner). We copy these in r and assign them to a variable.
Using a service to format the json (e.g. https://jsonformatter.curiousconcept.com/) we can turn the blob in a well readable json file. Next, we convert the JSON string in a R object (a nested list) by using a simple JSON to R language translation:
Finally, we evaluate the string as R-code, this gives us the DeepL web-services’ parameters as an R nested list.
All we have to do now is wrap the parameters in a R function and use variables to change the important ones:
I hope the post helps you turn more web-services into R-functions/packages.
If you are looking for other translation services have a look at the translate or translateR packages.