Distributed Data Structures in R for General, Large-Scale Computing

Michael J. Kane

Phronesis, LLC and Yale University


  • Simon Urbanek and AT&T Research Labs

  • A portion of this research is based on research sponsored by DARPA under award FA8750-12-2- 0324. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.

Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Deconstructing Distributed Computing

  • move expression to a parallel process
  • evaluate the expression

  • return the result

What do we need to facilitate this?

  • coordinator to describe the computation
  • worker to evaluate the expression
  • mechanism for getting data from the coordinator to the worker
  • possibly a mechanism for moving data between workers
  • mechanism for getting results to coordinator

Creating a distributed computing is about defining a communication protocol


  • data movement for a computation is defined up-front
  • workers (slaves) communicate directly through "mailboxes"
  • performance is excellent
  • not very fault-tolerant

Configuration service

R coordinator session

> library(rredis)
> redisConnect()
> redisLPush("worker_queue", "hello worker session")
[1] 1
> redisBLPop("coordinator_queue")
[1] "hello coordinator session" 

R worker session

> library(rredis)
> redisConnect()
> redisBRPop("worker_queue")
[1] "hello worker session"
> redisLPush("coordinator_queue", "hello coordinator session")
[1] 1 


  • workers block on queues (Redis lists)


  • buys us communication orthogonality


Build a communication framework that:

    1. Allows you to put computations into a cluster to be consumed
    2. Can return the result of a computation or a handle to the result
    3. Is elastic
    4. Does not require that data moves through a centralized, configuration manager
    5. Simple and lightweight


A phylum of simple marine animals that are regenerative

How does it work?

How is it used?

  • pull( <resource location>, <expression>, broadcast=FALSE ) - executes an expression at a resource location and returns the result
  • push( <resource location>, <expression>, resource_name=guid(), broadcast=FALSE) - executes an expression at a resource location and returns the name of a new resource

 push( "A", "B %*% pull('B', 'B')", "C" )

A Generative Communication Framework (cf Gelernter 1985)

  • space uncoupling (distributed naming)
  • time uncoupling
  • distributed sharing
  • support for continuation

Bottom line is that you can support a large class of distributed communication patterns.

Space Uncoupling (distributed naming)

A resource identifier can be used by any member of the cluster

push( "B", "B %*% pull('A', 'A')", "C" )
# "C" is now available to any member of the 
# cluster

push( "D", "pull('C', 'C')", "C")

Time uncoupling

A resource is available to the cluster until it is explicitly removed

 pull( "C", "rm(C)", broadcast=TRUE )

Distributed sharing

Data can be shared across R sessions

  1. Resources can be stored redundantly
  2. Provides multi-map capabilities
  3. Consistency implemented as a policy

Continuation passing

Expression arguments to pull and push can themselves call pull and push

How do we handle deadlocks?

Handling deadlocks with new R functionality

 pull("A","pull('B','pull(\"A\", \"A\")')")

The new mcparallel functionality

 > ?mcparallel


mcparallel(expr, name, mc.set.seed = TRUE, silent = FALSE,
                mc.affinity = NULL, mc.interactive = FALSE,
                detached = FALSE)

detached: logical, if ‘TRUE’ then the job is detached from
the current session and cannot deliver any results back - it
is used for the code side-effect only.

The background package

require(background)future = function(expr) {
  p = parallel::mcparallel(expr)
  async.add(p$fd[1], function(h, p) {
    }, p)

future({Sys.sleep(5); "done!" })
# in 5 seconds you'll see the output 

Conventions for distributed data structures (DDS's)

    1. A DDS should look and feel like analogous local data structures
    2. An operation involving a DDS should return a DDS by default
    3. A DDS with dimension attributes can be "emerged" on a local machine with "[]"
    4. You should be able to stream values held by a DDS with an iterator


dv <- distribute.vector(rnorm(1000), 79)
iv <- distribute.vector(sample(1:length(dv), 35, replace=TRUE), 14)
a <- dv[iv]
expect_that(dv[][iv[]], equals(a[])) 

  • implemented in vector blocks
  • index with vectors or distributed vectors
  • currently working on binary infix operators


tickerInfo <- getReturns("DORM")$full[["DORM"]]
ddf <- distribute.data.frame(tickerInfo, 83)
expect_that(ddf[], equals(tickerInfo)) it <- ibdf(ddf, chunkSize=47)
isi <- isplitIndices(nrow(tickerInfo), chunkSize=47)
expect_that(nextElem(it), equals(tickerInfo[nextElem(isi),]))

  • implemented in row blocks
  • indexed with vectors, distributed vectors


  • Still in early development
  • Underlying representation is a matrix of resource handles
  • Will support matrix multiplies
  • Will support sparse, dense, and mixed matrices
> a <- matrix(letters[1:4], nrow=2)
> a
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d" >
> char_prod(a, a)
[,1] [,2]
result.1 "a %*% a + c %*% b" "a %*% c + c %*% d"
result.2 "b %*% a + d %*% b" "b %*% c + d %*% d"

Future directions

Persistence model for DDS's

Streaming data worker topologies


Further Information:

In R type: 

  • Email me with questions at michael dot kane at yale dot edu.

