Practical Principles for Scalable Statistical Analysis

 

Michael Kane

Yale University and Phronesis LLC

Workshop on Distributed Computing in R

"The goal of this workshop is to standardize the API for exposing distributed computing in R, learn from the experiences of attendees in using R for large scale analysis, and collaborate in open source."

My Experience in Scalable Statistical Analysis

What's this talk about?

1. Compute cycles are cheaper than brain cycles.

"Because you're a C++ programmer, there's an above-average chance you're a performance freak. If you're not you're still probably sympathetic to their point of view. (If you're not at all interested in performance, shouldn't you be in the Python room down the hall?)"

-- Scott Meyers, Effective Modern C++

Performance freaks value task 

execution time over their free time.

Horizontal scalability beats vertical scalability when compute is a commodity

When performing an analysis, brain cycles are better spent on analysis, not implementation.

In Practice: Use the foreach and iterators libraries

library(foreach)
library(iterators)
library(doMC)
registerDoMC(cores = 1024)

ans = foreach(it=make_data_gen()) %dopar% {
  process_data(it)
}

What if my challenge isn't embarrassingly parallel?

  1. Get a bigger machine

  2. Talk to Bryan Lewis

2. Data Exploration is Critical

"The greatest value of a picture is when it forces us to notice what we never expected to see." --John Tukey

Scaling data exploration

Manage complexity/abundance with interactivity 

Organize data by:

  1. Resolution
  2. Division
    • Conditioning variable division
    • Replicate division

How do I navigate an abundance of visualizations?

"It seems natural to call such computer guided diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays" -John Tukey

In Practice: Use Trelliscope

3. There is still no panacea

Figure from Explaining and Harnessing Adversarial Examples by Goodfellow et al.

These images are classified with >99.6% confidence as the shown class by a Convolutional Network.

Working on a variety of different problems still requires a variety of tools and approaches

In practice: understand the problem domain and use CRAN/GitHub

Thanks