Download TAQ data via WRDS inside R | Onno Kleen

Download TAQ data via WRDS inside R

Welcome to my first blog post! I wanted to set up a blog for a while now and I am very excited that I finally found the time to do so. In the future, I plan to share some of my approaches for wrangling data and implementing real-time forecasting.

Todays blog post is about using the NYSE Trade and Quote (TAQ) database which holds intraday transactions data (trades and quotes) for the entire cross-section of stocks listed on the New York Stock Exchange (NYSE), the American Stock Exchange (AMEX), and the Nasdaq National Market System (NMS). For research in financial econometrics, it may be described as the de-facto reference data set if a paper is about intraday price movements on financial markets.

WRDS and TAQ via Postgres

For research purposes, TAQ files are typically accessed via a WRDS subscription (https://wrds-www.wharton.upenn.edu/). After signing up to WRDS via your institution and logging in, TAQ data can be accessed via a web query. However, due to large files sizes even for one single day of transaction data the web query only allows very restricted access per web query. Hence, if you want to look at a large cross-section of stocks over a longer period of time it is best to access the TAQ data via PostgreSQL-queries that can be automated; for example, in my preferred statistical language R.

As a prerequisite, please follow the WRDS documentation for setting up your local R session such that it can access TAQ data on WRDS via Postgres.

The following code snippet

  • establishes the connection to WRDS,
  • searches for available trading days in TAQ (df_dates),
  • selects one trading day of interest (dd), and
  • checks the column names of that day (colnames_tay_daily_file).

Technical note: In this blog post, I only consider the “Millisecond Trade and Quote” data set in which the first observations are recorded in 2003.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RPostgres)
library(highfrequency)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
wrds <- dbConnect(Postgres(),
                  host = 'wrds-pgdata.wharton.upenn.edu',
                  port = 9737,
                  dbname = 'wrds',
                  sslmode = 'require',
                  user = your_username)

res <- dbSendQuery(wrds, "select distinct table_name
                   from information_schema.columns
                   where table_schema='taqmsec'
                   order by table_name")
df_dates <- dbFetch(res, n = -1)
dbClearResult(res)

# df_dates contains also quotes table names and indices. However, we are only 
# interested in the trades tables for now which are called ctm_DATE. 
dates_trades <-
  df_dates %>%
  filter(grepl("ctm",table_name), !grepl("ix_ctm",table_name)) %>%
  mutate(table_name = substr(table_name, 5, 12)) %>%
  unlist()

# First and last day that are available in TAQ
min(dates_trades)
## [1] "_ctm_201"
max(dates_trades)
## [1] "20240418"

# Choose one exemplary trading day for our example
dd <- dates_trades[4000]
dd
## table_name4000 
##     "20190708"

# What are the names of the columns accessible via Postgres?
res <-
  dbSendQuery(wrds, 
              paste0("select column_name",
                     " from information_schema.columns",
                     " where table_schema='taqmsec'",
                     " and table_name = 'ctm_", dd, "'",
                     " order by column_name")) 
colnames_tay_daily_file <- dbFetch(res, n = -1)$column_name
dbClearResult(res) 
colnames_tay_daily_file
##  [1] "date"           "ex"             "part_time"      "part_time_nano"
##  [5] "price"          "size"           "sym_root"       "sym_suffix"    
##  [9] "time_m"         "time_m_nano"    "tr_corr"        "tr_id"         
## [13] "tr_rf"          "tr_scond"       "tr_seqnum"      "tr_source"     
## [17] "tr_stop_ind"    "trf_time"       "trf_time_nano"  "tte_ind"

With this knowledge in mind, we now download the transaction data for one exemplary stock, in this case Apple Inc. which has the ticker symbol AAPL. In the following, we do not need all columns so we select the ones relevant for us in the SQL-statement. We restrict ourselves to observations on the NYSE, AMEX, and Nasdaq (where (ex = 'N' or ex = 'T' or ex = 'Q' or ex = 'A')) which will become important in a possible future blog post about merging TAQ data with CRSP data. In this part, we only consider uncorrected trades as a first cleaning step (tr_corr = '00').

stock <- "AAPL"
# Fetch the data, create a DT column which pastes date and time together.
# In a future blog post we will merge our data with CRSP data, so we will 
# restrict ourselves to NYSE, AMEX, and NASDAQ. Prices should be nonzero 
# and we only consider non-corrected trades.
res <- 
  dbSendQuery(wrds,
              paste0("select concat(date, ' ',time_m) as DT,", 
                     " ex, sym_root, sym_suffix, price, size, tr_scond",
                     " from taqmsec.ctm_", dd,
                     " where (ex = 'N' or ex = 'T' or ex = 'Q' or ex = 'A')",
                     " and sym_root = '", stock, "'",
                     " and price != 0 and tr_corr = '00'"))
df_aapl <- dbFetch(res, n = -1)
dbClearResult(res)

df_aapl %>%
  head()
##                           dt ex sym_root sym_suffix  price size tr_scond
## 1 2019-07-08 04:00:47.678376  Q     AAPL       <NA> 203.15    3     @ TI
## 2 2019-07-08 04:01:11.592717  Q     AAPL       <NA> 203.10    3     @ TI
## 3 2019-07-08 04:01:16.476405  Q     AAPL       <NA> 203.05    3     @ TI
## 4 2019-07-08 04:02:06.346591  Q     AAPL       <NA> 203.00    3     @ TI
## 5 2019-07-08 04:02:19.347772  Q     AAPL       <NA> 202.95    3     @ TI
## 6 2019-07-08 04:03:11.572726  Q     AAPL       <NA> 202.90    3     @FTI

We follow the cleaning procedure in Barndorff-Nielsen et al. (2009) and rely on the R package highfrequency for which I am a co-author. Non-zero trades were already filtered in the Postgres-query. Next, we filter for valid sales conditions (see the documentation of highfrequency::salesCondition) and select a single exchange; in the case of Apple Inc. we choose NASDAQ. Thereafter, trades on the same time stamp are merged via highfrequency::mergeTradesSameTimestamp. Because TAQ data is huge, the highfrequency package assumes input to be data.tables for memory management purposes.

# Rename data for use in highfrequency package and clean the data
dt_aapl <- 
  df_aapl %>%
  rename(DT = dt, PRICE = price, SYM_ROOT = sym_root, SYM_SUFFIX = sym_suffix, 
         SIZE = size, EX = ex, COND = tr_scond) %>%
  mutate(DT = lubridate::ymd_hms(DT, tz = "UTC")) %>%
  data.table::as.data.table() %>%
  exchangeHoursOnly() %>% # only observations from 9:30 to 16:00
  tradesCondition() %>% # from highfrequency package
  selectExchange(c("T", "Q")) %>% # select NASDAQ
  mergeTradesSameTimestamp() %>% # merge trades from same time stamp
  rmOutliersTrades() # remove outliers
# The last step can also be implemented using additional quote data.
# However, merging additional quote data is beyond the scope of this blog post.

# First 6 observations of cleaned high-frequency data
head(dt_aapl)
##                     DT SYMBOL  PRICE NUMTRADES  SIZE     EX SYM_SUFFIX   COND
##                 <POSc> <char>  <num>     <int> <int> <char>     <char> <char>
## 1: 2019-07-08 09:30:00   AAPL 200.79         1    71      Q       <NA>   @  I
## 2: 2019-07-08 09:30:00   AAPL 200.78         1    34      Q       <NA>   @F I
## 3: 2019-07-08 09:30:00   AAPL 200.78         1    26      Q       <NA>   @F I
## 4: 2019-07-08 09:30:00   AAPL 200.78         1    36      Q       <NA>   @F I
## 5: 2019-07-08 09:30:00   AAPL 200.78         1     2      Q       <NA>   @F I
## 6: 2019-07-08 09:30:00   AAPL 200.78         1     2      Q       <NA>   @F I
##          DATE
##        <Date>
## 1: 2019-07-08
## 2: 2019-07-08
## 3: 2019-07-08
## 4: 2019-07-08
## 5: 2019-07-08
## 6: 2019-07-08

# Create a data.table with aggregated price information
dt_aapl_five_minutes <-
  dt_aapl %>%
  aggregatePrice(alignPeriod = 5)

# First 6 observations of aggregated five-minute data
head(dt_aapl_five_minutes)
##                     DT SYMBOL   PRICE NUMTRADES  SIZE     EX SYM_SUFFIX   COND
##                 <POSc> <char>   <num>     <int> <int> <char>     <char> <char>
## 1: 2019-07-08 09:30:00   AAPL 200.790         1    71      Q       <NA>   @  I
## 2: 2019-07-08 09:35:00   AAPL 201.180         1     9      Q       <NA>   @  I
## 3: 2019-07-08 09:40:00   AAPL 199.730         1    10      Q       <NA>   @  I
## 4: 2019-07-08 09:45:00   AAPL 199.585         1   100      Q       <NA>      @
## 5: 2019-07-08 09:50:00   AAPL 198.990         1   170      Q       <NA>      @
## 6: 2019-07-08 09:55:00   AAPL 198.870         1   100      Q       <NA>      @
##          DATE
##        <Date>
## 1: 2019-07-08
## 2: 2019-07-08
## 3: 2019-07-08
## 4: 2019-07-08
## 5: 2019-07-08
## 6: 2019-07-08

We now are able to plot the cleaned and aggregated time series. The aggregated time series is marked by blue dots.

# Plot the aggregated price series
dt_aapl %>%
  ggplot() +
  geom_line(aes(x = DT, y = PRICE)) +
  geom_point(aes(x = DT, y = PRICE), 
             data = dt_aapl_five_minutes, 
             colour = "blue", size = 2, na.rm = TRUE) +
  scale_x_datetime(date_breaks = "30 mins", # nicer breaks
                   date_labels = "%H:%M") +
  scale_y_continuous(limits = c(198,202)) +
  theme_minimal() + 
  xlab("") +
  ylab("Intraday price in USD") +
  ggtitle(paste0("AAPL on ", ymd(dd))) 

My main application of high-frequency data is to calculate realized measures on a daily frequency. Given our five-minute data set, we can construct a small tibble that holds open and close prices, the number of intraday observations, the realized variance, the realized skewness, and the realized kurtosis.

# Calculate some realized measures
dt_aapl_five_minutes %>%
  mutate(ret = 100 * (log(PRICE) - log(lag(PRICE, order_by = DT)))) %>%
  filter(is.na(ret) == FALSE) %>%
  {tibble(open_taq = first(dt_aapl_five_minutes$PRICE),
           close_taq = last(dt_aapl_five_minutes$PRICE),
           n_intraday = dim(dt_aapl)[1],
           rv = sum(.$ret^2),
           rskew = rSkew(.$ret),
           rkurt = rKurt(.$ret))}
## # A tibble: 1 × 6
##   open_taq close_taq n_intraday    rv rskew rkurt
##      <dbl>     <dbl>      <int> <dbl> <dbl> <dbl>
## 1     201.      200.      60188  1.22 -2.21  16.2

In a future blog post, I will extend this approach to multiple stocks and days via the foreach package which enables us to parallelize the computation of realized measures.

References

Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde, and N. Shephard (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.