A Sneak Peak at the ioregression Package
Michael Kane
Yale University and Phronesis LLC
What am I going to talk about?
- Provide an example illustrating numerical (in)stability
- Show that if you give up a little stability you can easily implement large (big n small p/big m small n) regressions out of core
- Introduce ioregression, a package that builds on iotools, a package for quickly streaming data from disk to R in chunks
Who else is involved with this?
Simon Urbanek - iotools
Taylor Arnold - iotools and ioregression
> x1 = 1:100
> x2 = x1 * 1.01 + rnorm(100, 0, sd=7e-6)
> X = cbind(x1, x2)
> y = x1 + x2
>
> qr.solve(t(X) %*% X) %*% t(X) %*% y
Error in qr.solve(t(X) %*% X) : singular matrix 'a' in solve
>
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
[,1]
x1 0.9869174
x2 1.0170043
> sd(y - X %*% beta)
[1] 0.1187066
>
> sd(lm(y ~ x1 + x2 -1)$residuals)
[1] 9.546656e-15
Why would I include x1 and x2 if they are (nearly) colinear?
> X = matrix(x1, ncol=1)
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
[,1]
[1,] 2.01
> sd(y - X %*% beta)
[1] 6.896263e-06
Assume numerical instability is colinearity
What does this buy us?
\hat{\beta} = \left(\sum_{i=1}^r X_i^T X_i \right)^{-1} \left(\sum_{i=1}^r X_i^T Y_i\right)
β^=(∑i=1rXiTXi)−1(∑i=1rXiTYi)
It is easy to implement OLS (and other regressions) in an embarrassingly parallel way.
Analyses can spend a lot of compute time doing IO
iotools to the rescue
library(iotools)
col_classes = c(rep("integer", 8), "character", "integer", "character",
rep("integer", 5), "character", "character", rep("integer", 4),
"character", rep("integer", 6))
system.time({
nr = chunk.apply("airline.csv",
function(x) {
df = dstrsplit(x, col_types=col_classes, sep=",")
nrow(df)
}, CH.MERGE=list, parallel=8)
})
Reduce(`+`, nr)
We can build regression routines on iotools
library(ioregression)
col_names = c("Year", "Month", "DayofMonth", "DayOfWeek", "DepTime",
"CRSDepTime", "ArrTime", "CRSArrTime", "UniqueCarrier",
"FlightNum", "TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance",
"TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted",
"CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay",
"LateAircraftDelay")
data_frame_preprocessor = function(x) {
colnames(x) = col_names
# Get rid of the header row.
if (x$UniqueCarrier[1] == "UniqueCarrier")
x = x[-1,]
x
}
system.time({
fit = iolm(DepDelay ~ ArrDelay, bzfile("2008.csv.bz2", "rb"), data_frame_preprocessor)
})
summary(fit, bzfile("2008.csv.bz2", "rb"), data_frame_preprocessor)
fix_times <- function(t) {
t <- as.character(t)
l4 <- nchar(t) == 4
l3 <- !l4
ret <- rep(0, length(t))
ret[l4] <- as.numeric(substr(t[l4], 1, 2)) * 60 +
as.numeric(substr(t[l4], 3, 4))
ret[l3] <- as.numeric(substr(t[l3], 1, 1)) * 60 +
as.numeric(substr(t[l3], 2, 3))
ret
}
data_frame_preprocessor = function(x) {
colnames(x) = col_names
# Get rid of the header row.
if (x$UniqueCarrier[1] == "UniqueCarrier")
x = x[-1,]
x$DepTime = fix_times(x$DepTime)
x$DayOfWeek = factor(as.character(x$DayOfWeek), levels=1:7,
labels=as.character(1:7))
x
}
system.time({
fit = iolm(DepDelay ~ DayOfWeek + DepTime, "airline.csv",
data_frame_preprocessor, parallel=8)
})
summary(fit, "airline.csv", data_frame_preprocessor, parallel=8)
Current Status
Under heavily development
-
Routines:
- OLS (initial version)
- LARS (initial version)
- GLM
- glmnet
-
Platforms
- iotools (done)
- ROctopus (coming)
Bay Area R User Group
By Michael Kane
Bay Area R User Group
- 2,104