Lecture 4 – Regression and Pairs Trading

Regression Analysis

Regression is a very important topic. It is a widely used statistical tool in economics, finance and trading. R provides pre-written functions that perform linear regressions in a very straightforward manner. There exist multiple add-on packages that allow for more advanced functionality. In this class, we will only utilize the lm() function which is available in the base installation of R. The following example demonstrates the use of this function:

 x    <- rnorm(1000)
 y    <- (x - 2) + rnorm(1000)
 outR <- lm(y ~ x)

What we did above was creat an independent variable x and a dependent variable y. The call to the function lm() performed an OLS (Ordinary Least Square’s) fit to the function: y = b0 + b1x + e, where e was distributed as N(mu, sigma^2)

The ~ is used to separate the independent from the dependent variables. The expression y ~ x is a formula that specifies the linear model with one independent variable and an intercept. If we wanted to fit the same model, but without the intercept, we would specify the formula as y ~ x – 1. This tells R to omit the intercept (force it to zero). The following graph illustrates the best fit lines for a model with and without an intercept. If you know that your model should contain an intercept term, then include it. Otherwise, do not include the intercept. For the subsequent pairs trading example, we are going to assume that the intercept term is equal to 0.

Graph of regression in R

Whenever a regression is performed, it is very important to analyze the residuals (e) of the fitted model. If everything goes according to plan, the residuals will be normally distributed with no visible pattern in the data, no auto-correlation and no heteroskedasticity. The residuals can be extracted from the regression object by using the residuals keyword.

 res <- outR$residuals
 par(mfrow = c(2,1))
 plot(res, col = "blue", main = "Residual Plot")

Here is a graph of both the residuals themselves and their auto-correlation.
residual plot in R and autocorrelation

The summary keyword can be used to obtain the results of the linear regression model fit.


The p-values and t-statistics can be used to evaluate the statistical significance of the coefficients. The smaller the p-value, the more certain we are that the coefficient estimate is close to the actual population coefficient. Notice how both the intercept and the independent variable coefficient is significant in this example. The extraction of the coefficients can be accomplished via
the coefficients keyword.


Pairs Trading

What follows is a really simple version of a pairs trade between two equities. The main motivation for the following example is to expose you to obtaining data via the quantmod package and in using the lm() function that we covered above.

We will explore a simple two-legged spread between AAPL and QQQQ. Once we download and transform the timeseries for both stocks, we will define a simple trading rule and explore the trades that our signal generates. Various trade statistics and graphs will be presented.

Step 1: Obtain data via quantmod
 #Utilize quantmod to load the security symbols
 symbols <- c("AAPL", "QQQQ")

Now that our data frames for AAPL and QQQQ are loaded into memory, let’s extract some prices.

Step 2: Extract prices and time ranges
 #define training set
 startT  <- "2007-01-01"
 endT    <- "2009-01-01"
 rangeT  <- paste(startT,"::",endT,sep ="")
 tAAPL   <- AAPL[,6][rangeT]
 tQQQQ   <- QQQQ[,6][rangeT]
 #define out of sample set
 startO  <- "2009-02-01"
 endO    <- "2010-12-01"
 rangeO  <- paste(startO,"::",endO,sep ="")
 oAAPL   <- AAPL[,6][rangeO]
 oQQQQ   <- QQQQ[,6][rangeO]

Notice how we defined and in-sample and out-of-sample range. We will use the in-sample data to compute a simple hedge ratio and then we will apply this hedge ratio to the out of sample data.

Step 3: Compute returns and find hedge ratio
 #compute price differences on in-sample data
 pdtAAPL <- diff(tAAPL)[-1]
 pdtQQQQ <- diff(tQQQQ)[-1]
 #build the model
 model  <- lm(pdtAAPL ~ pdtQQQQ - 1)
 #extract the hedge ratio
 hr     <- as.numeric(model$coefficients[1])
Step 4: Construct the spread
 #spread price (in-sample)
 spreadT <- tAAPL - hr * tQQQQ
 #compute statistics of the spread
 meanT    <- as.numeric(mean(spreadT,na.rm=TRUE))
 sdT      <- as.numeric(sd(spreadT,na.rm=TRUE))
 upperThr <- meanT + 1 * sdT
 lowerThr <- meanT - 1 * sdT
 #visualize the in-sample spread + stats
 plot(spreadT, main = "AAPL vs. QQQQ spread (in-sample period)")
 abline(h = meanT, col = "red", lwd =2)
 abline(h = meanT + 1 * sdT, col = "blue", lwd=2)
 abline(h = meanT - 1 * sdT, col = "blue", lwd=2)

trading in R pairs

Let’s look at the distribution of the spread.

 hist(spreadT, col = "blue", breaks = 100, main = "Spread Histogram (AAPL vs. QQQQ)")
 abline(v = meanT, col = "red", lwd = 2)

distribution of spread in R

Step 5: Define the trading rule

Once the spread exceeds our upper threshold, we sell AAPL and buy QQQQ. Once the spread drops below our lower threshold, we buy AAPL and sell QQQQ.

 indSell <- which(spreadT >= meanT + sdT)
 indBuy  <- which(spreadT <= meanT - sdT)
Step 6: Figure out the trades
 spreadL  <- length(spreadT)
 pricesB  <- c(rep(NA,spreadL))
 pricesS  <- c(rep(NA,spreadL))
 sp       <- as.numeric(spreadT)
 tradeQty <- 100
 totalP   <- 0
 for(i in 1:spreadL) {
     spTemp <- sp[i]
     if(spTemp < lowerThr) {
        if(totalP <= 0){
           totalP     <- totalP + tradeQty
           pricesB[i] <- spTemp
     } else if(spTemp > upperThr) {
       if(totalP >= 0){
          totalP <- totalP - tradeQty
          pricesS[i] <- spTemp
Step 7: Visualize trades
 plot(spreadT, main = "AAPL vs. QQQQ spread (in-sample period)")
 abline(h = meanT, col = "red", lwd =2)
 abline(h = meanT + 1 * sdT, col = "blue", lwd = 2)
 abline(h = meanT - 1 * sdT, col = "blue", lwd = 2)
 points(xts(pricesB,index(spreadT)), col="green", cex=1.9, pch=19)
 points(xts(pricesS,index(spreadT)), col="red", cex=1.9, pch=19)

R spread prices

, , ,

No comments yet.

Leave a Reply