The Russell 2000 Small-Cap Index, ticker symbol: ^RUT, is the hottest index of 2016 with YTD gains of over 18%. The index components are interesting not only because of recent performance, but because the top performers either grow to become mid-cap stocks or are bought by large-cap companies at premium prices. This means selecting the best components can result in large gains. In this post, I’ll perform a quantitative stock analysis on the entire list of Russell 2000 stock components using the R programming language. Building on the methodology from my S&P Analysis Post, I develop screening and ranking metrics to identify the top stocks with amazing growth and most consistency. I use R for the analysis including the
rvest library for web scraping the list of Russell 2000 stocks,
quantmod to collect historical prices for all 2000+ stock components,
purrr to map modeling functions, and various other
tidyverse libraries such as
tidyr to visualize and manage the data workflow. Last, I use
plotly to create an interactive visualization used in the screening process. Whether you are familiar with quantitative stock analysis, just beginning, or just interested in the R programming language, you’ll gain both knowledge of data science in R and immediate insights into the best Russell 2000 stocks, quantitatively selected for future returns!
In part 1 of the analysis, we screen the entire stock list using a reward-to-risk metric. Here’s a sneak peek at the
plotly interactive visualization, which aids in screening the stocks. The best stocks from the algorithm are those with highest
reward.metric. The color and size varies with the value of the
reward.metric. You can pan, zoom in, and hover over the points to gain information about the stocks.
In part 2 of the analysis, we review the top 15 stocks from part 1, developing a new, growth-to-consistency metric, to programmatically select the best of the best. Here’s a sneak peek at the top six stocks from the Russell 2000 index, with performance adjusted to remove stock splits.
Table of Contents
- Brief Overview
- Russell 2K Analysis: Part 1
- Russell 2K Analysis: Part 2
- Questions About the Analysis
- Download the .R File
- Further Reading
The S&P500 Analysis Post covered the fundamentals of quantitative stock analysis. I’ll spare you the details, but if interested I strongly recommend going through that post to get up to speed. The methodology leverages the fact that stock returns are approximately normally distributed and uncorrelated. Because of this, we can model the behavior of stock prices within a confidence interval using the mean and standard deviation of the stock returns. The general process is to collect the historical stock prices, calculate the daily log returns (we use log returns for structural reasons), then calculate the mean and standard deviation of the log returns. The mean characterizes the growth rate (reward) and the standard deviation characterizes the volatility (risk).
In this post, we build on what where we left off in the S&P500 Analysis Post, this time taking the analysis to a new level using a new stock index: the Russell 2000 Small Cap Index. The Russell 2000 index is a perfect candidate because it’s 4X the size of the S&P500 index, it contains only small-cap stocks (median market cap of $528M), and it’s not as well known meaning it’s full of hidden gems and takeover targets. Plus, it’s up over 18% this year!
In part 1 of this analysis, we analyze the Russell 2000 stock list honing in on the relationship between the mean and standard deviation of daily log returns. From there, we develop a reward-to-risk metric based on how the market tends to treat stocks. The end result is a
plotly interactive graph that enables visualizing the attributes of the best and worst stocks.
In part 2, we switch focus to the top 15 stocks from part 1, this time evaluating on how consistently each stock performs. We develop a new metric, growth-to-consistency which enables programmatically selecting the best stocks. We end by selecting the top 6 stocks with the unique combination of amazing growth, low volatility, and consistent returns.
The full code for the tutorial can be downloaded as a
.R file here.
For those following along in R, you’ll need to load the following packages:
If you don’t have these installed, run
install.packages(pkg_names) with the package names as a character vector (
pkg_names <- c("rvest", "quantmod", ...)). I also recommend the open-source RStudio IDE, which makes R Programming easy and efficient.
In part 1, the goal is to gain an overall understanding of the Russell 2000 Index. We’ll perform the following:
- Get the Russell 2K Stocks: Web Scraping with rvest
- Get Historical Prices and Log Returns: Function Mapping with quantmod and purrr
- Visualize the Relationship between Std Dev and Mean
- Develop a Screening Metric: Reward-to-Risk Metric
- Visually Screen with Plotly
It turns out that it is rather difficult to find the list of Russell 2000 stocks. The best website I found was www.marketvolume.com. The list is spread across tables on nine HTML pages, each containing roughly 250 stock components. We’ll collect the components using the
rvest package. To start, we get the base path and row numbers for each of the nine webpages.
Next, we create a function that we can
map() using the
purrr package. The function,
get_stocklist(), takes the
base_path and the
row_num, and using
rvest functions produces a table of stocks.
As an example, we can apply
get_stocklist() to the first page of nine, which is “row=0” in the html path.
Finally, we create a data frame of row numbers using the
row_num vector. Using the
purrr::map() function, we iterate the
get_stocklist() function across each of the row numbers. The result is a nested data frame with two levels. Using
tidyr::unnest(), we get the full list of Russell 2000 stocks on one level. The rest of the pipe (
%>%) operations after
unnest() just tidy the data.
The end result is a data frame of the Russell 2000 stocks:
Now that we have a list of the Russell 2000 stocks, we can collect some information. We need:
Historical Stock Prices: The daily stock prices are used to calculate the daily log returns. The function
quantmod::getSymbols()returns the stock prices. We use a wrapper function,
get_stock_prices(), to return the stock prices as a data frame in a consistent format needed for the
Daily Log Returns: Log returns are the basis for quantitative stock analysis, which enables statistical prediction of future stock prices using the mean and standard deviation of the log returns. The mean drives the growth rate, and the standard deviation drives the stock volatility. The function
quantmod::periodReturns()returns the logarithmic daily returns by setting
period = "daily"and
type = "log". We use a wrapper function,
get_log_returns(), to return the log returns from the historical stock prices as a data frame in a consistent format needed for the mapping process.
The code for the wrapper functions are provided below:
An example usage of
And, an example usage of
We can now get the
sd() of the log returns. We can also get the number of trade days using the
As we’ll see in the next section, these features are important to the risk-reward trade-off. For now, we need to collect these values for the list of stocks. We do this using
purrr::map() to apply functions to lists stored inside data frames. The next code chunk is the most complex of the post. Basically, we use the functions created previously to iteratively download the stock prices and compute the log returns. I added the
proc.time() functions to time the code. It will take about 15 minutes to run.
Warning: The following script stores the stock prices and log returns for the entire list of 2000+ Russell 2000 stock components. It takes my laptop a about 15 minutes to run the script.
And, a peek at the contents of
stocklist? It’s the historical stock prices and log returns for every stock in the Russell 2000 index. The stock prices and log returns are stored as nested lists inside the top-level data frame. We can access them like a list. Here’s the stock prices for the first observation in the list, 1-800 FLOWERS.COM, ticker symbol FLWS:
Now that we have the mean daily log returns (MDLR) and the standard deviation of daily log returns (SDDLR), we can start to visualize the data. The next plot shows an important trend: the relationship between SDDLR and MDLR. We first filter out stocks with the number of trade days (
n.trade.days) less than 2494 so each stock retained has the same large number of samples. Each year has approximately 250 trade days, so this filters out stocks with less than ten years of data to trend. Next, we limit stocks to those with SDDLR below 0.075. This allows us to zoom in on the vast majority of stocks. Plotting the trend using
ggplot2 shows an interesting phenomenon: stocks with a high SDDLR tend to perform worse than those with a low SDDLR.
The important point is that, while volatile stocks may have one or two good years, over the long haul the less volatile stocks are where you want to put your money. We can develop a screening metric using this rationale.
The screening metric we will use is a reward-to-risk metric. We want to reward stocks with high MDLR (growth rate). We want to penalize stocks with high SDDLR (volatility), since these stocks tend to perform worse over time. The constant, 2500, is multiplied to yield values generally in the range of 100 to -100. The equation then becomes:
- R is the reward-to-risk metric
- mu is the MDLR
- sigma is the SDDLR
Now we can add the reward-to-risk metric (
reward.metric) to our data frame. We remove stocks with less than ten years of trading data, then add the reward-to-risk metric.
Let’s use the
reward.metric to generate a visualization we can use for screening the stocks and understanding the index. Similar to the S&P500 post, we generate an interactive visualization using
plotly. However, this time we use the
reward.metric to drive the
size of the markers, which enables us to visually see which stocks are scoring high on risk-to-reward. The best stocks have a green color, and the worse stocks have a brown color. We can pan, zoom, and hover over stocks to gain additional insights.
The code chunk to generate this visualization:
The end goal is to find the best stocks, and we don’t want to simply trust the reward-to-risk metric. Rather, we want to review the characteristics of the top stocks so we can select those with the most consistent growth. In this section, we perform the following:
- Visualize Top 15 Stocks to Understand Consistent Growth
- Compute the Three Attributes of High Performing Stocks
- Develop a Ranking Metric: Growth-to-Consistency
- Visualize Performance of Top Six Stocks
We begin by filtering the
stocklist, first ranking by the
reward.metric then selecting the top 15.
Next, we create a
means_by_year() function to take a data frame of
log.returns and return a data frame of MDLRs by year. We then
means_by_year() function to iterate over the full data frame of log returns.
unnest() the high performers to get a one-level data frame. Voila, we have
mean.log.returns by year for each stock.
Finally, we can visualize the results in
ggplot2 using a facet plot.
When reviewing the facet plot, we want to select stocks with the following attributes that the market tends to reward:
Above zero MDLR: Every time a stocks MDLR drops below zero, the stock loses money for that year. All of the stocks drop below zero at least once. Those that drop below zero multiple times become bad investments in those respective years. We want good investments over the long haul, or in other words stocks with consistent, above-zero MDLR.
Flat or Upward Growth Trends: Remember, we are viewing MDLR, which is the growth rate. A flat trend means the stock is consistently growing. An upward trend means the stock’s growth rate is accelerating. A downward trend means the stocks growth rate is slowing. We want flat or upward growth.
Low Standard Deviation of MDLR by Year: Again, the market loves consistency. Less volatility makes for a more profitable investment.
We have two options now:
- We can manually review each chart to decide which stocks we want to invest in, or
- We can develop a method to programmatically rank the stocks.
Always opt for programmatic review! Programmatic review is less prone to errors and can be applied to a much larger set of stocks (ergo, while 15 stocks may be easy, 1500 becomes very difficult).
For the programatic review, we need to compute the three desired attributes of high performing stocks:
- The number of times the stock’s MDLR by year drops below zero (bad)
- The slope of the trend line of MDLR by year ()
- The standard deviation of MDLR by year
Attribute 1: Number of Times MDLR by Year Drops Below Zero
First, the number of times the stock drops below zero. We create a function
means_below_zero() that takes a data frame of
means.by.year for one stock and returns the number of MDLRs by year that are less than zero. We then map the function using
map_dbl(). Note that we use the
map_dbl() version of
map() returns a list and
map_dbl() returns a number. We want a number, not a list with the number in it.
Attribute 2: Slope of MDLR by Year
Next, we need to get the slope of the linear trend. The method we use is slightly more complex because we need to get the second coefficient of the linear model, but it is extremely powerful because we are applying models to data frames. We’ll follow the process outlined in R for Data Science, Chapter 25: Many Models.
We create a
means_by_year_model() function to apply a linear model to a single stock. The function takes a data frame of
means.by.year for the stock, and returns the model from the
Let’s test it out on the first stock, MAXIMUS, ticker symbol MMS:
We are interested in the coefficient for year, so let’s make one more function to extract the slope coefficient.
Again, let’s test it out on MMS to validate the workflow:
Now we are ready to apply the modeling and slope functions to the data frame:
Great! We now have the linear models and the slope of the linear trend line.
Attribute 3: Standard deviation of MDLR by year
Finally, to drive home consistency, we need the standard deviation of the MDLR by year. To do this we create a
sd_of_means_by_year() function that simply computes the
sd() of the MDLR by year. The function is then mapped the data frame using
map_dbl(), which returns the numeric value.
To assist in the final ranking process we’ll use a growth-to-consistency ranking metric, one that incorporates the
slope of the linear trend line, and
sd.of.means.by.year for MDLR by year. We develop the following measure that rewards stocks with positive growth rate and that penalizes stocks with high volatility year-to-year and multiple years of negative returns.
- m is the slope of the linear trend line
- n is the number of times the MDLR goes below zero
- s is the standard deviation of the MDLR by year
Now we add the new, growth-to-consistency metric (
growth.metric) to our data frame, and view the results.
Finally, we are ready to visualize the performance of the top six stocks. The code chunk below ranks high performance stocks by the growth-to-consistency metric (
growth.metric), then filters to the top six. From that point, the
stock.prices are selected and unnested to return the historical stock prices for each symbol in the top six performers. Last, a facet plot is made using the historical stock prices adjusted for stock splits. These are the top performers!
Should you simply invest in the stocks screened? Why or why not? (Hint: What else about the stocks and/or companies should you investigate?)
Are there other factors not considered in the metrics that should be included? (Hint: What do characteristics do some of the top 15 performers have that is not included in the final metric, G?)
Can you combine the two metrics, R and G, into one metric that can be computed for the entire data frame of all Russell 2000 stocks?
The full code for the tutorial can be downloaded as a
.R file here. The code will take approximately 15 minutes to run.
As shown in the previous S&P500 Analysis Post, quantitative analysis is a powerful tool. By applying data science techniques using R programming and the various packages (e.g.
purrr, etc), we can evaluate massive data sets and quickly screen the stocks using reward-to-risk and growth-to-consistency metrics. However, a word of caution before jumping into any investments: Selecting investments on statistical analysis alone is never a good idea. The statistical analysis allows us to screen stocks as potential investments, but a thorough analysis of the company and the stock should be performed. Evaluation of stock and company fundamentals such as asset valuation (forward and trailing P/E ratio), industry analysis, diversification, etc should be pursued prior to making any investment decision. With that said, the screening process described herein is an excellent first step. Once you have investments worth investigating, I recommend reading articles on a website such as Seeking Alpha from experts with experience covering the stocks of interest.
Well done if you made it this far! This post covered a vast array of R programming functions, data management workflows, and data science techniques that can be used regardless of the application. The Russell 2000 Post extended the investment screening analysis from the previous S&P500 Analysis Post by using a variety of data science and modeling techniques.
We built custom functions to retrieve stock symbols, historical stock prices, and daily log returns. We used a custom function to web scrape stock symbols, leveraging the
rvestpackage. We also used custom functions as wrappers for
periodReturns()functions from the
quantmodlibrary. We mapped the custom functions to data frames using the
purrr::map()function. The result was a data frame of stock prices and daily log returns for every stock in the Russell 2000 index.
We visualized the relationship between MDLR and SDDLR using
ggplot2, which gave us insight into how stocks function within the market. We used this information to develop a reward-to-risk metric that served as a useful way to screen stocks when applied to an interactive screening visualization using
We pared down the list to the top 15 stocks, manipulating the data to visualize the level of consistency of our high performers using
ggplot2. We then developed three desirable attributes of high performers. The attributes were added using the data modeling workflow, which consists of creating functions that return data frames, lists, or values and mapping the functions using
We developed a final metric that measured growth-to-consistency of our high performers. This measure enabled us to rank the high performers, and select the top 6 for further review. The share price performance of the top 6 was visualized using
Great work! If you understand this post, you now have many tools at your disposal that apply to much more than just investment analysis.
R For Data Science, Chapter 25: Many Models: Chapter 25 covers the workflow for modeling many models using
modelrpackages. This a very powerful resource that helps you extend a single model to many models. The entire R for Data Science book is free and online.
Seeking Alpha: A website for the investing community focused on providing investment analysis and insight. Once you have screened stocks, a next logical step is to collect information. The analysis provided on SA can help you finalize your investment decisions.