A Sneak Peak at the ioregression Package

Michael Kane

Yale University and Phronesis LLC

What am I going to talk about?

  • Provide an example illustrating numerical (in)stability
  • Show that if you give up a little stability you can easily implement large (big n small p/big m small n) regressions out of core
  • Introduce ioregression, a package that builds on iotools, a package for quickly streaming data from disk to R in chunks

Who else is involved with this?

Simon Urbanek - iotools

Taylor Arnold - iotools and ioregression

> x1 = 1:100
> x2 = x1 * 1.01 + rnorm(100, 0, sd=7e-6)
> X = cbind(x1, x2)
> y = x1 + x2
> 
> qr.solve(t(X) %*% X) %*% t(X) %*% y
Error in qr.solve(t(X) %*% X) : singular matrix 'a' in solve
> 
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
        [,1]
x1 0.9869174
x2 1.0170043
> sd(y - X %*% beta)
[1] 0.1187066
> 
> sd(lm(y ~ x1 + x2 -1)$residuals)
[1] 9.546656e-15

Why would I include x1 and x2 if they are (nearly) colinear?

> X = matrix(x1, ncol=1)
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
     [,1]
[1,] 2.01
> sd(y - X %*% beta)
[1] 6.896263e-06

Assume numerical instability is colinearity

What does this buy us?

\hat{\beta} = \left(\sum_{i=1}^r X_i^T X_i \right)^{-1} \left(\sum_{i=1}^r X_i^T Y_i\right)
β^=(i=1rXiTXi)1(i=1rXiTYi)

It is easy to implement OLS (and other regressions) in an embarrassingly parallel way.

Analyses can spend a lot of compute time doing IO

iotools to the rescue

library(iotools)

col_classes = c(rep("integer", 8), "character", "integer", "character",
                rep("integer", 5), "character", "character", rep("integer", 4),
                "character", rep("integer", 6))

system.time({
  nr = chunk.apply("airline.csv",
              function(x) {
                df = dstrsplit(x, col_types=col_classes, sep=",")
                nrow(df)
              }, CH.MERGE=list, parallel=8)
})
Reduce(`+`, nr)

We can build regression routines on iotools

library(ioregression)
col_names = c("Year", "Month", "DayofMonth", "DayOfWeek", "DepTime",
              "CRSDepTime", "ArrTime", "CRSArrTime", "UniqueCarrier",
              "FlightNum", "TailNum", "ActualElapsedTime", "CRSElapsedTime",
              "AirTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance",
              "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted",
              "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay",
              "LateAircraftDelay")

data_frame_preprocessor = function(x) {
  colnames(x) = col_names
  # Get rid of the header row.
  if (x$UniqueCarrier[1] == "UniqueCarrier")
    x = x[-1,]
  x
}

system.time({
  fit = iolm(DepDelay ~ ArrDelay, bzfile("2008.csv.bz2", "rb"), data_frame_preprocessor)
})

summary(fit, bzfile("2008.csv.bz2", "rb"), data_frame_preprocessor)
fix_times <- function(t) {
  t <- as.character(t)
  l4 <- nchar(t) == 4
  l3 <- !l4
  ret <- rep(0, length(t))
  ret[l4] <- as.numeric(substr(t[l4], 1, 2)) * 60 +
    as.numeric(substr(t[l4], 3, 4))
  ret[l3] <- as.numeric(substr(t[l3], 1, 1)) * 60 +
    as.numeric(substr(t[l3], 2, 3))
  ret
} 

data_frame_preprocessor = function(x) {
  colnames(x) = col_names
  # Get rid of the header row.
  if (x$UniqueCarrier[1] == "UniqueCarrier")
    x = x[-1,]
  x$DepTime = fix_times(x$DepTime)
  x$DayOfWeek = factor(as.character(x$DayOfWeek), levels=1:7,
                       labels=as.character(1:7))
  x
}

system.time({
  fit = iolm(DepDelay ~ DayOfWeek + DepTime, "airline.csv", 
             data_frame_preprocessor, parallel=8)
})

summary(fit, "airline.csv", data_frame_preprocessor, parallel=8)

Current Status

Under heavily development

  • Routines:
    • OLS  (initial version)
    • LARS (initial version)
    • GLM
    • glmnet
  • Platforms
    • iotools (done)
    • ROctopus (coming)

Bay Area R User Group

By Michael Kane

Bay Area R User Group

  • 1,973