Practical Principles for Scalable Statistical Analysis
Michael Kane
Yale University and Phronesis LLC
Workshop on Distributed Computing in R
"The goal of this workshop is to standardize the API for exposing distributed computing in R, learn from the experiences of attendees in using R for large scale analysis, and collaborate in open source."
My Experience in Scalable Statistical Analysis
What's this talk about?
1. Compute cycles are cheaper than brain cycles.
"Because you're a C++ programmer, there's an above-average chance you're a performance freak. If you're not you're still probably sympathetic to their point of view. (If you're not at all interested in performance, shouldn't you be in the Python room down the hall?)"
-- Scott Meyers, Effective Modern C++
Performance freaks value task
execution time over their free time.
Horizontal scalability beats vertical scalability when compute is a commodity
When performing an analysis, brain cycles are better spent on analysis, not implementation.
In Practice: Use the foreach and iterators libraries
library(foreach) library(iterators) library(doMC) registerDoMC(cores = 1024) ans = foreach(it=make_data_gen()) %dopar% { process_data(it) }
What if my challenge isn't embarrassingly parallel?
-
Get a bigger machine
-
Talk to Bryan Lewis
2. Data Exploration is Critical
"The greatest value of a picture is when it forces us to notice what we never expected to see." --John Tukey
Scaling data exploration
Manage complexity/abundance with interactivity
Organize data by:
- Resolution
- Division
- Conditioning variable division
- Replicate division
How do I navigate an abundance of visualizations?
"It seems natural to call such computer guided diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays" -John Tukey
In Practice: Use Trelliscope
3. There is still no panacea
Figure from Explaining and Harnessing Adversarial Examples by Goodfellow et al.
These images are classified with >99.6% confidence as the shown class by a Convolutional Network.
Working on a variety of different problems still requires a variety of tools and approaches
In practice: understand the problem domain and use CRAN/GitHub
Thanks
Practical Principles for Scalable Statistical Analysis
By Michael Kane
Practical Principles for Scalable Statistical Analysis
My talk from the April NYC R Conference
- 2,155