Scalable, Exact Approaches to Fitting Linear Models when n >> p
Michael Kane, Taylor Arnold, and Simon Urbanek
iotools provides
-
A framework for computing on streamed chunks of data
-
Optimized routines for moving data from disk to processor
Let's build a regression library on top of it!
The design of ioregression
-
Build regression preprocessors, kernels, and combiners for different types of linear regressions.
-
Use iotools to stream data to the kernels and aggregate the results to get regression coefficients and diagnostics.
Why not just use foreach?
ioregression is foreach compatible
foreach doesn't give you
-
optimized io routines
-
Doesn't currently handle Hadoop's data locality
Goal is to provide numerical operations required for chunk-wise regressions.
Why linear regressions?
Linear Regressions are Simple
-
Assumptions are known
-
Diagnostics for assessing fit and assumptions
-
Results are interpretable
Linear Regressions are Computationally Efficient
-
Complexity is O(np^2)
-
Model fitting can be parallelized
Linear Regressions are Compatible with Other Analyses
> train_inds = sample(1:150, 130)
> iris_train = iris[train_inds,]
> iris_test = iris[setdiff(1:150, train_inds),]
> lm_fit = lm(Sepal.Length ~ Sepal.Width+Petal.Length,
+ data=iris_train)
> sd(predict(lm_fit, iris_test) - iris_test$Sepal.Length)
[1] 0.3879703
> library(randomForest)
> iris_train$residual = lm_fit$residuals
> iris_train$fitted_values = lm_fit$fitted.values
> rf_fit = randomForest(
+ residual ~ Sepal.Width+Petal.Length+fitted_values,
+ data=iris_train)
>
> lm_pred = predict(lm_fit, iris_test)
> iris_test$fitted_values = lm_pred
> sd(predict(rf_fit, iris_test)+lm_pred - iris_test$Sepal.Length)
[1] 0.3458326
> rf_fit2 = randomForest(
+ Sepal.Length ~ Sepal.Width+Petal.Length,
+ data=iris_train)
> sd(predict(rf_fit2, iris_test) - iris_test$Sepal.Length)
[1] 0.40927
Calculating the OLS slope coefficients
> x1 = 1:100
> x2 = x1 * 1.01 + rnorm(100, 0, sd=7e-6)
> X = cbind(x1, x2)
> y = x1 + x2
>
> qr.solve(t(X) %*% X) %*% t(X) %*% y
Error in qr.solve(t(X) %*% X) : singular matrix 'a' in solve
>
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
[,1]
x1 0.9869174
x2 1.0170043
> sd(y - X %*% beta)
[1] 0.1187066
>
> sd(lm(y ~ x1 + x2 -1)$residuals)
[1] 9.546656e-15
What if we think of x1 and x2 as being highly correlated instead?
> X = matrix(x1, ncol=1)
> beta = qr.solve(t(X) %*% X, tol=0) %*% t(X) %*% y
> beta
[,1]
[1,] 2.01
> sd(y - X %*% beta)
[1] 6.896263e-06
What does this buy us?
The regression calculation becomes
\hat{\beta} = \left(\sum_{i=1}^r X_i^T X_i \right)^{-1} \left(\sum_{i=1}^r X_i^T Y_i\right)
β^=(∑i=1rXiTXi)−1(∑i=1rXiTYi)
It is easy to implement OLS (and other regressions) in an embarrassingly parallel way.
How do I use it?
library(ioregression)
file_name = "2008.csv"
data = adf(file_name, sep=",", header=TRUE)
system.time({iofit = iolm(DepDelay ~ Distance, data=data)})
## user system elapsed
## 124.884 2.810 127.802
How fast is lm?
system.time({
df = read.table(file_name, header=TRUE, sep=",")
lmfit = lm(DepDelay ~ Distance, data=df)
})
## user system elapsed
## 134.436 3.577 138.258
Status
Under development
- Routines:
- OLS
- Ridge
- LARS
- GLM
- glmnet (coming)
- Platforms
- iotools (done)
- others?
Scalable, Exact Approaches to Fitting Linear Models when n >> p
By Michael Kane
Scalable, Exact Approaches to Fitting Linear Models when n >> p
- 2,100