Practical Principles for analyzing data in R

Michael Kane

Assistant Professor of Biostatistics

Yale University

What do I mean by a "Practical Principle"?

A truth concerned with actually doing something that serves as a foundation for a behavior*.

*Synthesized from here and here.

Practical Principles for this talk

Programming with Data
Finding Structure in Data
Computing with Data
Productizing Data Analyses

Practical Principles For programming with Data

Part 1: Use R

IEEE Top Programming languages*

*http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

What about R works when computing with data?

It's not the performance benchmarks

Jan vitek's python/R benchmarks

*Shamelessly taken from https://dl.dropboxusercontent.com/u/20315677/r-meetup-2016-slides.pdf

The Streetlight Effect*: a type of observational bias where people only look for whatever they are searching by looking where it is easiest.

* David H. Freedman (August 1, 2010). "The Streetlight Effect". Discover magazine.

R's syntax values development time over run time.

R makes it easy to create dialects. Some of them are even useful.

A dialect of a programming language or a data exchange language is a (relatively small) variation or extension of the language that does not change its intrinsic nature. - Wikipedia

Borrowing Pipes

create_summary(analyze(clean(load(file_name))))

library(magrittr)

file_name %>% load %>% clean %>% 
  analyze %>% create_summary

Practical Principles For programming with Data

Part 1: Use R

Part 2: don't always use R

On a given day I'll program in

R, bash, sed, awk, c++, python, and julia

...but i'm often calling other languages from R


pirls = function(X, y, lambda, alpha=1, family=binomial, maxit=500, 
                 tol=1e-8, beta=as(spMatrix(nrow=ncol(X), ncol=1), "dgCMatrix"),
                      beta_update=coordinate_descent) {
  converged = FALSE
  for(i in 1:maxit) {
    # Note that there aren't sparse family functions.
    eta      = as.vector(X %*% beta)
    g        = family()$linkinv(eta)
    gprime   = family()$mu.eta(eta)
    z        = eta + (y - g) / gprime
    W        = as.vector(gprime^2 / family()$variance(g))
    beta_old = beta
    beta = beta_update(X, W, z, lambda, alpha, beta, maxit)
    if(sqrt(as.vector(Matrix::crossprod(beta-beta_old))) < tol) {
      converged = TRUE
      break
    }
  }
  list(beta=beta, iterations=i, converged=converged)
}


template <typename ModelMatrixType, typename WeightMatrixType, 
          typename RealVectorType, typename BetaMatrixType, 
          typename UIntVectorType>
BetaMatrixType coordinate_descent(const ModelMatrixType &X, 
  const WeightMatrixType &W, const RealVectorType &z, const double &lambda,
  const double &alpha, BetaMatrixType beta, 
  const UIntVectorType &active_cols, unsigned int maxit)
{
  double thresh = accu(W) * lambda * alpha;
  double quad_loss = quadratic_loss(X, W, z, lambda, alpha, beta);
  double quad_loss_old;
  BetaMatrixType beta_old(beta);
  for (unsigned int i=0; i < maxit; ++i) {
    beta_old = beta;
    beta = arma::sp_mat(beta_old.n_rows, beta_old.n_cols);
    quad_loss_old = quad_loss;
    for (auto it=active_cols.begin(); it != active_cols.end(); ++it)
      beta(*it, 0) = update_coordinate(*it, X, W, z, lambda, alpha,
                                       beta_old, thresh);
    quad_loss = quadratic_loss(X, W, z, lambda, alpha, beta);
    if (quad_loss >= quad_loss_old) {
      beta = beta_old;
      break;
    }
  }
  return beta;
}

In practice: Rcpp


mycd = function(X, W, z, lambda, alpha, beta, active_cols, maxit = 1000L) {
  W = as(diag(length(W)), "dgCMatrix")
  beta = as(beta, "dgCMatrix")
  z = matrix(z, nrow=length(z))
  c_coordinate_descent(X, W, z, lambda, alpha, beta,
                       as.integer(0:(ncol(X)-1)), maxit)
}

fit = pirls(x, y, lambda=lambda, family=gaussian, beta_update=mycd)

observations

C++ is hard to program in

R makes it easy for me to

- Build a reference implementation

- Profile my code

- Call C++

Use R for algorithm development, optimization, building, testing, and benchmarking.

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

"The greatest value of a picture is when it forces us to notice what we never expected to see." --John Tukey

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

Part 2: Think carefully about what you want to solve before thinking about the analysis

An example in neruo-psychology

The Consortium for Neuropsychiatric Phenomics (CNP) study includes imaging of a large group of healthy individuals from the community (138 subjects), as well as samples of individuals diagnosed with schizoprenia (58), bipolar disorder (49), and ADHD (45).

The participants, ages 21-50, were recruited by community advertisements from the Los Angeles area and completed extensive neuropsychogical testing, in addition to fMRI scanning. (From OpenFMRI)

Experiment

In the Balloon Analogue Risk Task (BART) a participant is presented with a balloon and offered the chance to earn money by pumping the balloon up by clicking a button. Each click causes the balloon to incrementally inflate and money to be added to a counter up until some threshold, at which point the balloon is over inflated and explodes. Thus, each pump confers greater risk, but also greater potential reward. If the participant chooses to cash-out prior to the balloon exploding then they collect the money earned for that trail, but if balloon explodes earnings for that trial are lost. Participants are not informed about the balloons breakpoints. (From Impulsivity)

We want to see if we can diagnose from BART Telemetry

onset,duration,reaction_time,trial_number,trial_type,button_pressed,action,explode_trial,trial_value,trial_cumulative_value,payoff_level,onset_noTriggerAdjust,pid

2.03216603999999,1.20839207400013,1.20839207400013,1,BALOON,5,ACCEPT,1,5,5,5,0.0321660399999928,sub-10159

5.87107120700011,0.625477546000184,0.625477546000184,1,BALOON,5,ACCEPT,1,5,10,5,3.87107120700011,sub-10159

8.76679230200034,0.561760724999658,0.561760724999658,1,BALOON,5,ACCEPT,1,5,15,5,6.76679230200034,sub-10159

11.1826955680003,0,0,1,BALOON,NA,EXPLODE,1,0,15,5,9.18269556800033,sub-10159

14.9222235540001,0.470330939000178,0.470330939000178,2,BALOON,5,ACCEPT,0,5,5,5,12.9222235540001,sub-10159

Different number of samples per patient

- We want to see if we can diagnose a subject

- Different number of samples per subject

- Not clear how to normalize patients

- Not clear which features we should extract

Quantify a subject as a set of samples and define a distance between sets

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

Part 2: Think carefully about what you want to solve before thinking about the analysis

Part 3: The less you understand an analysis the less you should trust it

Breaking convnets on imagenet

Figure from Explaining and Harnessing Adversarial Examples by Goodfellow et al.

Practical Principles For Computing with Data in R

Part 1: compute cycles are cheaper than brain cycles

Practical Principles For Computing with Data in R

Part 1: compute cycles are cheaper than brain cycles

Part 2: horizontal scalability beats vertical when compute is a commodity

library(foreach)
library(iterators)
library(doMC)
registerDoMC(cores = 1024)

ans = foreach(it=make_data_gen()) %dopar% {
  process_data(it)
}

Practical Principles For productizing analyses in R

Part 1: Exploration is not analysis

Exploration is where you...

Analysis is where you...

- Develop intuition

- Find surprises in data

- Formalize a hypothesis

- Fit and evaluate a model

- (Fail to) Reject a hypothesis

- Provide a product

Practical Principles For productizing analyses in R

Part 1: Exploration is not analysis

Part 2: an Analysis may make a good web application

An example in Targeted Cancer Therapy

There is no panacea

Data understanding before models

Principles before procedures

Problem-solving skills before technology

Practical Principles for analyzing data in R

What do I mean by a "Practical Principle"?

Practical Principles for this talk

Practical Principles For programming with Data

Part 1: Use R

IEEE Top Programming languages*

What about R works when computing with data?

It's not the performance benchmarks

Jan vitek's python/R benchmarks

The Streetlight Effect*: a type of observational bias where people only look for whatever they are searching by looking where it is easiest.

R's syntax values development time over run time.

R makes it easy to create dialects. Some of them are even useful.

Borrowing Pipes

Practical Principles For programming with Data

Part 1: Use R

Part 2: don't always use R

On a given day I'll program in

R, bash, sed, awk, c++, python, and julia

...but i'm often calling other languages from R

In practice: Rcpp

observations

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

"The greatest value of a picture is when it forces us to notice what we never expected to see." --John Tukey

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

Part 2: Think carefully about what you want to solve before thinking about the analysis

An example in neruo-psychology

Experiment

We want to see if we can diagnose from BART Telemetry

Different number of samples per patient

Quantify a subject as a set of samples and define a distance between sets

Practical Principles For Analyzing Data with R

Part 1: Exploration is critical

Part 2: Think carefully about what you want to solve before thinking about the analysis

Part 3: The less you understand an analysis the less you should trust it

Breaking convnets on imagenet

Practical Principles For Computing with Data in R

Part 1: compute cycles are cheaper than brain cycles

Practical Principles For Computing with Data in R

Part 1: compute cycles are cheaper than brain cycles

Part 2: horizontal scalability beats vertical when compute is a commodity​

Practical Principles For productizing analyses in R

Part 1: Exploration is not analysis

Exploration is where you...

Analysis is where you...

Practical Principles For productizing analyses in R

Part 1: Exploration is not analysis

Part 2: an Analysis may make a good web application

An example in Targeted Cancer Therapy

There is no panacea

Part 2: horizontal scalability beats vertical when compute is a commodity