Using R to download high frequency trade data directly from Bovespa

[This article was first published on R and Finance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using package GetHFData –

Recently, Bovespa, the Brazilian financial exchange company, allowed external access to its ftp site. In this address one can find several information regarding the Brazilian financial system, including datasets with high frequency (tick by tick) trading data for three different markets: equity, options and BMF.

Downloading and processing these files, however, can be exausting. The dataset is composed of zip files with the whole trading data, separated by day and market. These files are huge in size and processing or aggregating them in a usefull manner requires specific knowledge for the structure of the dataset.

The package GetHFData make is easy to access this dataset directly by allowing the easy importation and aggregations of it. Based on this package the user can:

  • Access the contents of the Bovespa ftp using function function ghfd_get_ftp_contents
  • Get the list of available ticker in the trading data using ghfd_get_available_tickers_from_ftp
  • Download individual files using ghfd_download_file
  • Download and process a batch of dates and assets codes with ghfd_get_HF_data

More details about the usage of the package can be found in my SSRN paper. Next, we present an example of how to use the package in an empirical application. This particular example is based in Perlin and Henrique (2016).

Liquidity and the time of the day

In order to illustrate the usage of the software with aggregated data, the chosen problem is the analysis of the intraday U shaped pattern of liquidity in the equity market. This particular issue has been found and discussed in several papers from the literature such as Admati and Pfleiderer (1988), Back and Pedersen (1998), Engle and Russell (1998), Groß-Klußmann and Hautsch (2011), among many others.

The data used in this empirical study is related to the six most traded assets in the period of fifteen trading days from 2016-09-12 until 2016-09-30. The use of a small time period is not accidental. We chose to keep fifteen days as it facilitates the replication of the example by decreasing the time needed to download the dataset.

The first step is to select the liquid assets to run the empirical research. To do that, we select the six most traded assets in the last date of the study (2016-09-30) by checking the available tickers from the ftp site in this date. The following code executes this procedure.

library(GetHFData)

## Thank you for using GetHFData! More details about the package can be found in:
## 
##  https://ssrn.com/abstract=2824058 
## 
##  If applicable, please use the following citations in your research report. Thanks! 
## 
## APA:
##  Perlin, M., Ramos, H. (2016). GetHFData: A R Package for Downloading and Aggregating High Frequency Trading Data from Bovespa. Available at SSRN. 
## 
## bibtex:
##  @article{perlin2016gethfdata,
##   title={GetHFData: A R Package for Downloading and Aggregating High Frequency Trading Data from Bovespa},
##   author={Perlin, Marcelo and Henrique, Ramos},
##   journal={Available at SSRN},
##   year={2016}
## }

n.assets <- 6
my.date <- as.Date('2016-09-30')
type.market <- 'equity'

df.tickers <- ghfd_get_available_tickers_from_ftp(my.date = my.date, 
                                                  type.market = type.market)

## 
## Reading ftp contents for equity (attempt = 1|10) Attempt 1 - File exists, skipping dl

As explained before, function ghfd_get_available_tickers_from_ftp will output a vector with the number of trades for each ticker found in the dataset. As a robustness check, we can use package ggplot2 (Wickham 2009) to create a figure to illustrate the number of trades for each of the 25 most traded stocks in the date of 2016-09-30, as shown next.

library(ggplot2)

temp.df <- df.tickers[1:25, ]

p <- ggplot(temp.df, aes(x = reorder(tickers, -n.trades), y = n.trades))
p <- p + geom_bar(stat = "identity")
p <- p + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p <- p + labs(x = 'Tickers', y = 'Number of trades')
print(p)

From the previous figure we can see that the six most traded assets in 2016-09-30 are ITSA4, PETR4, ITUB4, BBDC4, ABEV3, BBSE3. A particular feature of the high frequency data from Brazil is that the liquidity is disperse and decreases rapidly across the assets, as we can see from the previosu graphic. Even though we are only looking at trading data for one day, we can expect that the number of trades will also drop quickly in other time periods.

From the programming side, the dataframe df.tickers is already sorted by the number of trades so, in order to select the six most traded assets, we select the first six elements of df.tickers$tickers.

my.assets <- df.tickers$tickers[1:n.assets]

And now we can print it to check its content:

print(my.assets)

## [1] ITSA4 PETR4 ITUB4 BBDC4 ABEV3 BBSE3
## 462 Levels: AALC34 AAPL34 ABCB10 ABCB4 ABEV3 AEFI11 AELP3 AFLT3 ... XTED11

We continue the empirical example by using the package GetHFData to download and aggregate the desired information for later analysis. The first step in this stage is to set the options for downloading the dataset. Notice that it is good policy to set the object my.folder as the name of a folder in the computer’s hard disk where the user has writing permission in order to download the files. We set an example path as PATH TO YOUR FOLDER HERE. We make it clear that the user has to modify this object in order for the code to run without error. Users in the Windows platform should be aware that the folder path has to be set using forward slashes (/) and not backslashes, which is the default.

As for the intraday time periods, we use a first time of 10:30:00 and last as 16:30:00 in order to avoid the trading noise from the opening and closing of the market, which could bias our results. The options used with GeHFData are set as follows.

my.folder<-'PATH TO YOUR FOLDER HERE'
#setwd(my.folder)

first.time <- '10:30:00'
last.time <- '16:30:00'

first.date <- as.Date('2016-09-12')
last.date <- as.Date('2016-09-30')
type.output <- 'agg'
agg.diff <- '15 min'

my.assets <- c("ITSA4", "PETR4", "ITUB4", "BBDC4", "ABEV3", "BBSE3")
type.market <- 'equity'

After setting the inputs, we now use function ghfd_get_HF_data to download and aggregate the financial data.

 df.out <- ghfd_get_HF_data(my.assets = my.assets,
                           type.market = type.market,
                           first.date = first.date,
                           last.date = last.date,
                           first.time = first.time,
                           last.time = last.time,
                           type.output = type.output,
                           agg.diff = agg.diff)

## 
## Running ghfd_get_HF_Data() for:
##    type.market = equity
##    my.assets = ITSA4, PETR4, ITUB4, BBDC4, ABEV3, BBSE3
##    type.output = agg
##       agg.diff = 15 min
## Reading ftp contents for equity (attempt = 1|10)
##    Found  517  files in ftp
##    First available date in ftp:  2014-12-22
##    Last available date in ftp:   2017-01-02
##    First date to download:  2016-09-12
##    Last date to download:   2016-09-30
## Downloading ftp files/NEG_20160912.zip (1|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  861097 lines, 463 unique tickers
##    -> Processing file - Found 136851 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160913.zip (2|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  1173408 lines, 427 unique tickers
##    -> Processing file - Found 236805 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160914.zip (3|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  836391 lines, 426 unique tickers
##    -> Processing file - Found 178667 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160915.zip (4|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  721937 lines, 436 unique tickers
##    -> Processing file - Found 189977 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160916.zip (5|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  726891 lines, 409 unique tickers
##    -> Processing file - Found 127384 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160919.zip (6|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  666277 lines, 457 unique tickers
##    -> Processing file - Found 114975 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160920.zip (7|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  733741 lines, 427 unique tickers
##    -> Processing file - Found 134744 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160921.zip (8|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  924202 lines, 443 unique tickers
##    -> Processing file - Found 177271 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160922.zip (9|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  858705 lines, 451 unique tickers
##    -> Processing file - Found 146025 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160923.zip (10|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  799861 lines, 440 unique tickers
##    -> Processing file - Found 131548 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160926.zip (11|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  564303 lines, 478 unique tickers
##    -> Processing file - Found 90066 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160927.zip (12|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  778917 lines, 434 unique tickers
##    -> Processing file - Found 177992 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160928.zip (13|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  764733 lines, 481 unique tickers
##    -> Processing file - Found 136236 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160929.zip (14|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  790120 lines, 432 unique tickers
##    -> Processing file - Found 155809 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows
## Downloading ftp files/NEG_20160930.zip (15|15) Attempt 1 - File exists, skipping dl
##    -> Reading files - Imported  790261 lines, 462 unique tickers
##    -> Processing file - Found 172620 lines for 6 selected tickers
##    -> Aggregation resulted in dataframe with 144 rows

We point out that the previous code will take some time to finish as it has to download and read several large files from Bovespa ftp site. Once it is finished, we can check the output of ghfd_get_HF_data by calling function str for the object df.out, which will show the textual representation of the object in the R environment.

str(df.out)

## 'data.frame':    2160 obs. of  13 variables:
##  $ InstrumentSymbol: chr  "ABEV3" "ABEV3" "ABEV3" "ABEV3" ...
##  $ SessionDate     : Date, format: "2016-09-12" "2016-09-12" ...
##  $ TradeDateTime   : POSIXct, format: "2016-09-12 10:30:00" "2016-09-12 10:45:00" ...
##  $ n.trades        : int  531 1143 441 1168 603 618 617 512 492 379 ...
##  $ last.price      : num  19.8 19.7 19.8 19.7 19.8 ...
##  $ weighted.price  : num  19.7 19.7 19.7 19.7 19.8 ...
##  $ period.ret      : num  -0.000506 -0.001013 0.001014 -0.002532 0.003553 ...
##  $ period.ret.volat: num  0.000388 0.000175 0.000261 0.000319 0.000282 ...
##  $ sum.qtd         : num  314000 584500 1210500 281800 189100 ...
##  $ sum.vol         : num  6200565 11522865 23873576 5553286 3736272 ...
##  $ n.buys          : int  304 170 199 405 230 283 181 273 348 170 ...
##  $ n.sells         : int  227 973 242 763 373 335 436 239 144 209 ...
##  $ Tradetime       : chr  "10:30:00" "10:45:00" "11:00:00" "11:15:00" ...

As described earlier, the object returned from ghfd_get_HF_data is a dataframe with several columns calculated from the raw data. Notice that the columns already have the correct class, which facilitates the future manipulation of the data.

Once the data is available, we proceed to the analysis of the intraday pattern of liquidity. To do so, we use the number of trades as a proxy for liquidity. The analysis will be based on the visual examination of a figure that relates the distribution of number of trades to the time of the day. Since the number of trades are not comparable across assets, we plot the same figure for different stocks. Next, we show the R code that creates the figure based on the ggplot2 package.

p <- ggplot(df.out, aes(x =  Tradetime, y = n.trades))
p <- p + geom_boxplot() + coord_cartesian(ylim = c(0, 3000))
p <- p  + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p <- p + facet_wrap(~InstrumentSymbol)
p <- p + labs(y='Number of Trades', x = 'Time of Day')
print(p)

In the previous figure we show the number of trades as a function of the time of the day. As expected, we find that the intraday shape of liquidity follows a U pattern, that is, the number of trades rises in the beginning and ending of the day, with the smallest value around 13:15:00. Such a pattern is found for the great majority of the assets.

This result is supported by previous findings in the literature (Engle and Russell 1998,Groß-Klußmann and Hautsch (2011)). In the beginning of the trading day, a significant volume of overnight information is priced at the market, which justifies the increase of the number of trades. As for the end of the day, the higher volume of trades can be explained as a inventory strategy by the investors or market makers, which aims to finish the day with null portfolio positions in order to avoid the overnight risk. Since the decrease of portfolio size is achieved with more trades, we see a significant increase of negotiations at the end of the day. Interestingly, this pattern for liquidity is correlated to a pattern of intraday volatility (Andersen and Bollerslev 1997).

References

Admati, Anat R, and Paul Pfleiderer. 1988. “A Theory of Intraday Patterns: Volume and Price Variability.” Review of Financial Studies 1 (1). Soc Financial Studies: 3–40.

Andersen, Torben G, and Tim Bollerslev. 1997. “Intraday Periodicity and Volatility Persistence in Financial Markets.” Journal of Empirical Finance 4 (2). Elsevier: 115–58.

Back, Kerry, and Hal Pedersen. 1998. “Long-Lived Information and Intraday Patterns.” Journal of Financial Markets 1 (3). Elsevier: 385–402.

Engle, Robert F, and Jeffrey R Russell. 1998. “Autoregressive Conditional Duration: A New Model for Irregularly Spaced Transaction Data.” Econometrica. JSTOR, 1127–62.

Groß-Klußmann, Axel, and Nikolaus Hautsch. 2011. “When Machines Read the News: Using Automated Text Analytics to Quantify High Frequency News-Implied Market Reactions.” Journal of Empirical Finance 18 (2). Elsevier: 321–40.

Perlin, Marcelo, and Ramos Henrique. 2016. “GetHFData: A R Package for Downloading and Aggregating High Frequency Trading Data from Bovespa.” Available at SSRN.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

To leave a comment for the author, please follow the link and comment on their blog: R and Finance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)