Trade Classification in R with PINstimation

[This article was first published on Stories by PINstimation on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The use of high-frequency data has gained widespread attention and popularity in contemporary financial research thanks to its potential to capture minute fluctuations and provide detailed insights into the behavior of financial markets. However, analyzing such data can pose significant challenges due to the high level of activity observed in current markets, which are largely dominated by high-frequency trading. Typically, high-frequency data is aggregated into discrete intraday periods or daily data using trade classification algorithms. In this article, we will explore how to classify and aggregate high-frequency data using the PINstimation package.

Trade classification algorithms

PINstimation package implements four algorithms for trade classification¹: Tick, Quote, LR, and EMO. Let’s take a closer look at each of these:

  • Tick: A trade is classified as a buy (sell) if the price of the trade to be classified is above (below) the closest different price of a previous trade.
  • Quote: Classifies a trade as a buy (sell) if the trade price of the trade to be classified is above (below) the mid-point of the bid and ask spread. Trades executed at the mid-spread are not classified.
  • LR²: Classifies a trade as a buy (sell) if its price is above (below) the mid-spread (quote algorithm), and uses the tick algorithm if the trade price is at the mid-spread.
  • EMO³: Classifies trades at the bid (ask) as sells (buys) and uses the tick algorithm to classify trades within the then prevailing bid-ask spread.

Package functions

The package offers two functions that are specifically designed for the classification and aggregation of intraday trades:

★ classify_trades() classifies high-frequency trades using one of the aforementioned algorithms. It has the arguments:

  • data: a dataframe with four variables in the following order (timestamp, price, bid, ask).
  • algorithm: the trade classification algorithm used to determine the trade initiator. It takes one of four values: “Tick”, “Quote”, “LR”, and “EMO”. The default value is “Tick”.
  • timelag: the time lag in milliseconds used to calculate the lagged mid-quote for the methods “Quote”, “EMO”, and “LR”. The default value is 0.

★ aggregate_trades() aggregates high-frequency trades using one of the aforementioned algorithms, and has two additional arguments.

  • frequency: the frequency used to aggregate intraday data. It takes one of the values: “sec”, “min”, “hour”, “day”, “week”, “month”. The default value is “day”.
  • unit : an integer referring to the size of the aggregation window used to aggregate intraday data. The default value is 1. For example, when the parameter frequency is set to “min”, and the parameter unit is set to 15, then the intraday data is aggregated every 15 minutes.

Usage examples

We use a dataset called hfdata included in the package as raw data to aggregate. It is a simulated dataset containing sample timestamp, price, volume, bid, and ask for 100.000 high frequency transactions.

  • To begin, we delete the variable volume and store the remaining data in the variablexdata.
library(PINstimation)
xdata <- hfdata
xdata$volume <- NULL
  • Use the EMO algorithm with a timelag of 500 milliseconds to classify
    high-frequency trades in xdata.
ctrades <- classify_trades(xdata, algorithm = "EMO", timelag = 500,
 verbose = T)
  • Display the first two rows of the dataframe ctrades:
head(ctrades, 2)
##             timestamp   price     bid     ask isbuy
## 38 2018-10-18 00:13:10 15.4754 15.4568 15.4754  TRUE
## 49 2018-10-18 00:17:52 15.5143 15.5143 15.5236  TRUE
  • Use the LR algorithm with a timelagof 1 second to aggregate intraday data in xdata at a frequency of 15 minutes.
lrtrades <- aggregate_trades(xdata, algorithm = "LR", timelag = 1000,
                              frequency = "min", unit = 15, verbose = TRUE)
  • Use the Quote algorithm with a timelagof 1 second to aggregate intraday data inxdataat a daily frequency.
qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000,
                              frequency = "day", unit = 1, verbose = TRUE)
  • Display the first two rows of qtrades
head(qtrades, 2)
##    b   s
## 1 873 746
## 2 823 793
  • The output qtrades consists of a pair of daily sequences of buyer-initiated, and seller-initiated trades. These trading data are frequently used as input for the estimation of various research models, particularly in the field of market microstructure. One prominent model is the Probability of Informed Trading model⁴ (PIN), which typically requires quarterly datasets of daily buyer-initiated and seller-initiated trades. To estimate the PIN model, PINstimation provides the function pin_ea(), which computes the maximum-likelihood estimator of PIN using the initial parameter sets of Ersan O, Alıcı A (2016)⁵ . As such, this function can directly use the output qtrades for the estimation of the PIN model.
model <- pin_ea(qtrades)
show(model)
## ----------------------------------
## PIN estimation completed successfully
## ----------------------------------
## [...]
## ==========  ===========
## Variables   Estimates  
## ==========  ===========
## alpha       0.739135   
## delta       0          
## mu          247.46     
## eps.b       548.72     
## eps.s       717.26     
## ----                   
## Likelihood  (1232.001) 
## PIN         0.126237   
## ==========  ===========
## 
## -------
## Running time: 0.789 seconds
  • Display the optimal parameter estimates and the PIN value
model_ea@parameters
## alpha     delta      mu          eps.b       eps.s 
## 0.7499975 0.1333342 1193.5179655 357.2659099 328.6291793

model_ea@pin
## [1] 0.5661721

Conclusion

The PINstimation package is a highly efficient tool for classifying and aggregating high-frequency data. With just a few lines of code, it enables you to quickly classify trades using a range of algorithms and time lags, and aggregate them for virtually any desired frequency. Moreover, the classification process is fast, making it an ideal option for researchers who are working with large datasets. More information about the package and its functions can be found in the package documentation and on the dedicated website.

For more great examples of R in action, check out R-bloggers and R-users.

References

  1. Aktas O, Kryzanowski L (2014) Trade classification accuracy for the bist. Journal of International Financial Markets, Institutions and Money 33:259–282, DOI
  2. Lee CM, Ready MJ (1991) Inferring trade direction from intraday data. The Journal of Finance 46(2):733–746
  3. Ellis K, Michaely R, O’Hara M (2000) The accuracy of trade classification rules: Evidence from nasdaq. The Journal of Financial and Quantitative Analysis 35(4):529, DOI
  4. Easley, N. M. Kiefer, M. O’Hara, and J. B. Paperman. Liquidity, information, and infrequently traded stocks. The Journal of Finance, 51(4):1405, 9 1996. ISSN 00221082. DOI.
  5. Ersan O, Alıcı A (2016) An unbiased computation methodology for estimating the probability of informed
    trading (pin). Journal of International Financial Markets, Institutions and Money 43:74–94, DOI
To leave a comment for the author, please follow the link and comment on their blog: Stories by PINstimation on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)