The use of high-frequency data has gained widespread attention and popularity in contemporary financial research thanks to its potential to capture minute fluctuations and provide detailed insights into the behavior of financial markets. However, analyzing such data can pose significant challenges due to the high level of activity observed in current markets, which are largely dominated by high-frequency trading. Typically, high-frequency data is aggregated into discrete intraday periods or daily data using trade classification algorithms. In this article, we will explore how to classify and aggregate high-frequency data using the PINstimation package.
Trade classification algorithms
PINstimation package implements four algorithms for trade classification¹: Tick, Quote, LR, and EMO. Let’s take a closer look at each of these:
- Tick: A trade is classified as a buy (sell) if the price of the trade to be classified is above (below) the closest different price of a previous trade.
- Quote: Classifies a trade as a buy (sell) if the trade price of the trade to be classified is above (below) the mid-point of the bid and ask spread. Trades executed at the mid-spread are not classified.
- LR²: Classifies a trade as a buy (sell) if its price is above (below) the mid-spread (quote algorithm), and uses the tick algorithm if the trade price is at the mid-spread.
- EMO³: Classifies trades at the bid (ask) as sells (buys) and uses the tick algorithm to classify trades within the then prevailing bid-ask spread.
The package offers two functions that are specifically designed for the classification and aggregation of intraday trades:
★ classify_trades() classifies high-frequency trades using one of the aforementioned algorithms. It has the arguments:
- data: a dataframe with four variables in the following order (timestamp, price, bid, ask).
- algorithm: the trade classification algorithm used to determine the trade initiator. It takes one of four values: “Tick”, “Quote”, “LR”, and “EMO”. The default value is “Tick”.
- timelag: the time lag in milliseconds used to calculate the lagged mid-quote for the methods “Quote”, “EMO”, and “LR”. The default value is 0.
★ aggregate_trades() aggregates high-frequency trades using one of the aforementioned algorithms, and has two additional arguments.
- frequency: the frequency used to aggregate intraday data. It takes one of the values: “sec”, “min”, “hour”, “day”, “week”, “month”. The default value is “day”.
- unit : an integer referring to the size of the aggregation window used to aggregate intraday data. The default value is 1. For example, when the parameter frequency is set to “min”, and the parameter unit is set to 15, then the intraday data is aggregated every 15 minutes.
We use a dataset called hfdata included in the package as raw data to aggregate. It is a simulated dataset containing sample timestamp, price, volume, bid, and ask for 100.000 high frequency transactions.
- To begin, we delete the variable volume and store the remaining data in the variablexdata.
library(PINstimation) xdata <- hfdata xdata$volume <- NULL
- Use the EMO algorithm with a timelag of 500 milliseconds to classify
high-frequency trades in xdata.
ctrades <- classify_trades(xdata, algorithm = "EMO", timelag = 500, verbose = T)
- Display the first two rows of the dataframe ctrades:
head(ctrades, 2) ## timestamp price bid ask isbuy ## 38 2018-10-18 00:13:10 15.4754 15.4568 15.4754 TRUE ## 49 2018-10-18 00:17:52 15.5143 15.5143 15.5236 TRUE
- Use the LR algorithm with a timelagof 1 second to aggregate intraday data in xdata at a frequency of 15 minutes.
lrtrades <- aggregate_trades(xdata, algorithm = "LR", timelag = 1000, frequency = "min", unit = 15, verbose = TRUE)
- Use the Quote algorithm with a timelagof 1 second to aggregate intraday data inxdataat a daily frequency.
qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000, frequency = "day", unit = 1, verbose = TRUE)
- Display the first two rows of qtrades
head(qtrades, 2) ## b s ## 1 873 746 ## 2 823 793
- The output qtrades consists of a pair of daily sequences of buyer-initiated, and seller-initiated trades. These trading data are frequently used as input for the estimation of various research models, particularly in the field of market microstructure. One prominent model is the Probability of Informed Trading model⁴ (PIN), which typically requires quarterly datasets of daily buyer-initiated and seller-initiated trades. To estimate the PIN model, PINstimation provides the function pin_ea(), which computes the maximum-likelihood estimator of PIN using the initial parameter sets of Ersan O, Alıcı A (2016)⁵ . As such, this function can directly use the output qtrades for the estimation of the PIN model.
model <- pin_ea(qtrades) show(model) ## ---------------------------------- ## PIN estimation completed successfully ## ---------------------------------- ## [...] ## ========== =========== ## Variables Estimates ## ========== =========== ## alpha 0.739135 ## delta 0 ## mu 247.46 ## eps.b 548.72 ## eps.s 717.26 ## ---- ## Likelihood (1232.001) ## PIN 0.126237 ## ========== =========== ## ## ------- ## Running time: 0.789 seconds
- Display the optimal parameter estimates and the PIN value
model_ea@parameters ## alpha delta mu eps.b eps.s ## 0.7499975 0.1333342 1193.5179655 357.2659099 328.6291793 model_ea@pin ##  0.5661721
The PINstimation package is a highly efficient tool for classifying and aggregating high-frequency data. With just a few lines of code, it enables you to quickly classify trades using a range of algorithms and time lags, and aggregate them for virtually any desired frequency. Moreover, the classification process is fast, making it an ideal option for researchers who are working with large datasets. More information about the package and its functions can be found in the package documentation and on the dedicated website.
For more great examples of R in action, check out R-bloggers and R-users.
- Aktas O, Kryzanowski L (2014) Trade classification accuracy for the bist. Journal of International Financial Markets, Institutions and Money 33:259–282, DOI
- Lee CM, Ready MJ (1991) Inferring trade direction from intraday data. The Journal of Finance 46(2):733–746
- Ellis K, Michaely R, O’Hara M (2000) The accuracy of trade classification rules: Evidence from nasdaq. The Journal of Financial and Quantitative Analysis 35(4):529, DOI
- Easley, N. M. Kiefer, M. O’Hara, and J. B. Paperman. Liquidity, information, and infrequently traded stocks. The Journal of Finance, 51(4):1405, 9 1996. ISSN 00221082. DOI.
- Ersan O, Alıcı A (2016) An unbiased computation methodology for estimating the probability of informed
trading (pin). Journal of International Financial Markets, Institutions and Money 43:74–94, DOI