Michael Kane
Assistant Professor of Biostatistics
Yale University
A truth concerned with actually doing something that serves as a foundation for a behavior*.
*Shamelessly taken from https://dl.dropboxusercontent.com/u/20315677/r-meetup-2016-slides.pdf
* David H. Freedman (August 1, 2010). "The Streetlight Effect". Discover magazine.
A dialect of a programming language or a data exchange language is a (relatively small) variation or extension of the language that does not change its intrinsic nature. - Wikipedia
create_summary(analyze(clean(load(file_name))))
library(magrittr)
file_name %>% load %>% clean %>%
analyze %>% create_summary
pirls = function(X, y, lambda, alpha=1, family=binomial, maxit=500,
tol=1e-8, beta=as(spMatrix(nrow=ncol(X), ncol=1), "dgCMatrix"),
beta_update=coordinate_descent) {
converged = FALSE
for(i in 1:maxit) {
# Note that there aren't sparse family functions.
eta = as.vector(X %*% beta)
g = family()$linkinv(eta)
gprime = family()$mu.eta(eta)
z = eta + (y - g) / gprime
W = as.vector(gprime^2 / family()$variance(g))
beta_old = beta
beta = beta_update(X, W, z, lambda, alpha, beta, maxit)
if(sqrt(as.vector(Matrix::crossprod(beta-beta_old))) < tol) {
converged = TRUE
break
}
}
list(beta=beta, iterations=i, converged=converged)
}
template <typename ModelMatrixType, typename WeightMatrixType,
typename RealVectorType, typename BetaMatrixType,
typename UIntVectorType>
BetaMatrixType coordinate_descent(const ModelMatrixType &X,
const WeightMatrixType &W, const RealVectorType &z, const double &lambda,
const double &alpha, BetaMatrixType beta,
const UIntVectorType &active_cols, unsigned int maxit)
{
double thresh = accu(W) * lambda * alpha;
double quad_loss = quadratic_loss(X, W, z, lambda, alpha, beta);
double quad_loss_old;
BetaMatrixType beta_old(beta);
for (unsigned int i=0; i < maxit; ++i) {
beta_old = beta;
beta = arma::sp_mat(beta_old.n_rows, beta_old.n_cols);
quad_loss_old = quad_loss;
for (auto it=active_cols.begin(); it != active_cols.end(); ++it)
beta(*it, 0) = update_coordinate(*it, X, W, z, lambda, alpha,
beta_old, thresh);
quad_loss = quadratic_loss(X, W, z, lambda, alpha, beta);
if (quad_loss >= quad_loss_old) {
beta = beta_old;
break;
}
}
return beta;
}
mycd = function(X, W, z, lambda, alpha, beta, active_cols, maxit = 1000L) {
W = as(diag(length(W)), "dgCMatrix")
beta = as(beta, "dgCMatrix")
z = matrix(z, nrow=length(z))
c_coordinate_descent(X, W, z, lambda, alpha, beta,
as.integer(0:(ncol(X)-1)), maxit)
}
fit = pirls(x, y, lambda=lambda, family=gaussian, beta_update=mycd)
C++ is hard to program in
R makes it easy for me to
- Build a reference implementation
- Profile my code
- Call C++
Use R for algorithm development, optimization, building, testing, and benchmarking.
The Consortium for Neuropsychiatric Phenomics (CNP) study includes imaging of a large group of healthy individuals from the community (138 subjects), as well as samples of individuals diagnosed with schizoprenia (58), bipolar disorder (49), and ADHD (45).
The participants, ages 21-50, were recruited by community advertisements from the Los Angeles area and completed extensive neuropsychogical testing, in addition to fMRI scanning. (From OpenFMRI)
In the Balloon Analogue Risk Task (BART) a participant is presented with a balloon and offered the chance to earn money by pumping the balloon up by clicking a button. Each click causes the balloon to incrementally inflate and money to be added to a counter up until some threshold, at which point the balloon is over inflated and explodes. Thus, each pump confers greater risk, but also greater potential reward. If the participant chooses to cash-out prior to the balloon exploding then they collect the money earned for that trail, but if balloon explodes earnings for that trial are lost. Participants are not informed about the balloons breakpoints. (From Impulsivity)
onset,duration,reaction_time,trial_number,trial_type,button_pressed,action,explode_trial,trial_value,trial_cumulative_value,payoff_level,onset_noTriggerAdjust,pid
2.03216603999999,1.20839207400013,1.20839207400013,1,BALOON,5,ACCEPT,1,5,5,5,0.0321660399999928,sub-10159
5.87107120700011,0.625477546000184,0.625477546000184,1,BALOON,5,ACCEPT,1,5,10,5,3.87107120700011,sub-10159
8.76679230200034,0.561760724999658,0.561760724999658,1,BALOON,5,ACCEPT,1,5,15,5,6.76679230200034,sub-10159
11.1826955680003,0,0,1,BALOON,NA,EXPLODE,1,0,15,5,9.18269556800033,sub-10159
14.9222235540001,0.470330939000178,0.470330939000178,2,BALOON,5,ACCEPT,0,5,5,5,12.9222235540001,sub-10159
- We want to see if we can diagnose a subject
- Different number of samples per subject
- Not clear how to normalize patients
- Not clear which features we should extract
Figure from Explaining and Harnessing Adversarial Examples by Goodfellow et al.
library(foreach)
library(iterators)
library(doMC)
registerDoMC(cores = 1024)
ans = foreach(it=make_data_gen()) %dopar% {
process_data(it)
}
- Develop intuition
- Find surprises in data
- Formalize a hypothesis
- Fit and evaluate a model
- (Fail to) Reject a hypothesis
- Provide a product
Data understanding before models
Principles before procedures
Problem-solving skills before technology