Distributed Data Structures in R for General, Large-Scale Computing

Michael J. Kane

Phronesis, LLC and Yale University


Acknowledgements


  • Simon Urbanek and AT&T Research Labs

  • A portion of this research is based on research sponsored by DARPA under award FA8750-12-2- 0324. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.



Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.


Current Prototypical Communication For Distributed Computing

Redis

Communication

Orthogonality

  • Workers block on queues (lists)
  • Consume computations
  • Return results on another queue



Current Prototypical Communication For Distributed Computing

Redis

Performance


Orthogonality

  • Workers block on queues (lists)
  • Return results on another queue


MPI

  • Data movement is defined up-front
  • Processes communicate directly

Communication


How does this work?




How does this work?


How does this work?


How does this work?



How does this work?


How is it used?


  • pull( <resource>, <expression> ) - executes an expression at a resource location and returns the result
  • push( <resource>, <expression> ) - executes an expression at a resource location and returns the name of a new resource


 C <- push( "B", "B %*% pull('A', 'A')" )


A Generative Communication Framework (cf Gelernter 1985)


  • Can be though of as a "functional" version of tuplespaces

    • space uncoupling
    • time uncoupling
    • distributed sharing
    • support for continuation


Bottom line is that you can support a much larger class of distributed communication patterns.


What's the status?

  • Currently alpha


  • Communication framework is almost complete and fully asynchronous


  • Currently building data structures on top of the framework
    • distributed.vector
    • distributed.data.frame


  • Working on distributed matrices (mixed sparse and dense ) this summer

  • We have a persistence model


  • We have a persistence model


Further Information:


In R type:
 ?parallel:::mcparallel 







  • Next week at the NYC Data Science Meetup.


  • Email me with questions at michael dot kane at yale dot edu.