Augmenting Knowledge Discovery in PubMed with R


Michael J. Kane

Phronesis LLC and Yale University


Acknowledgements


The Bill and Melinda Gates Foundation - Round 11 Grand Challenges in Global Health


Carson Sievert (Iowa State University) and Kenneth Shirley (AT&T Labs Research)



Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of The Bill and Melinda Gates Foundation or AT&T Labs Research. 



What's the point of this talk?

>=


WRT text:


Data we

can collect


Data we

can search




>=

Data we

can understand

What's the point of this talk?

>=


WRT text:


Data we

can collect


Data we

can search



Data we

can understand

>=

What's the point of this talk?


>=

Data we

can collect

>=

Data we

can search

>=

Data we

can analyze

Data we can understand

How much data do we have?


As of August 2012 Facebook stores more than 100 petabytes of data

This is 900720000000000000 (9e17) bits

If a bit were an inch wide: 14215900000000 (1.4e13) miles

  • The sun is only about 92000000 (9.2e7) miles away
  • 2.41828705 light years


Facebook's daily breakdown


2.7 billion likes made daily on and off of the Facebook site


300 million photos uploaded


70,000 queries executed by people and automated systems


500+ terabytes of new data "ingested"


Who cares about your stupid Facebook posts?


Why are they being stored?


Do they have monetary value?


How are they used?


The Cohen Conjecture


"For increasingly large sets of data, access to individual samples decreases exponentially over time." - David Cohen



The Significance of Search


"The Information Age, also commonly known as the Computer Age or Information Era, is an idea that the current age will be characterized by the ability of individuals to transfer information freely, and to have instant access to knowledge that would have been difficult or impossible to find previously" - Wikipedia



Data + Search


How much data?


Google had indexed about 100 PB as of 2013


"Processing" about 20 PB per day


NSA "touches" 29.21 PB which is 1.6 % of the 1.8 EB transferred over the internet daily.

We can search through PB's of data






What if we don't know what we're looking for?





What if I'm not looking for what everyone else has already found?


The problem with PageRank for research


PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.


Articles in PubMed (http://dan.corlan.net/medline-trend.html)


# Articles in PubMed by year
require(ggplot2)
x <- scan("http://dan.corlan.net/cgi-bin/medline-trend?Q=",
what=character(), sep="\n", skip=1)
x <- x[-length(x)]
x <- gsub("^[ ]+", "", x)
x <- gsub("[ ]+", " ", x)
pm_pubs <- as.data.frame(matrix(
as.numeric(unlist(strsplit(x[-1], " "))), ncol=3, byrow=TRUE))
names(pm_pubs) <- c("Number", "Year", "Percent")
qplot(Year, Number, data=pm_pubs, geom="line")



Why PubMed?


Huge repository of knowledge in health science


Programmatic accessibility


Enormous demand for interdisciplinary research between data scientists and clinical investigators

What do you mean by "augmented knowledge discovery" and what does it have to to with PubMed?


Computers don't do research (yet)


Individually, humans don't understand a lot of the available research


Can we use machine learning (statistics) to help human discovery in scientific literature?


Use search to create a research context



Rather than searching for specific terms define the domain of interest and explore the landscape of articles returned




That's a nice narrative but what the hell does it mean and how do you implement it?


Document processing


  1. Retrieve documents from PubMed
  2. Create term-document matrices
  3. Apply a learner to the TDM
  4. Visualize the results
  5. Iterate

Retrieving documents from PubMed programmatically


Code is available at https://github.com/kaneplusplus/akde-pubmed


> source("entrez.r")
> query <- 'brucellosis OR "machine learning"'
> max_ids <- 2000
> doc <- pm_doc_info(query, max_ids)
> names(doc)
[1] "title" "author" "date"
[4] "journal" "publication_type" "url"
[7] "title_and_abstract" 

> plot_article_activity(query, max_ids)
> journal_hist(query, max_ids)

brucellosis OR "machine learning"


brucellosis OR "machine learning"



Term document matrices


Document 1: He walks the dog up the hill                

Document 2: She sees the cat at the top of the hill



             He  walks  dog   up   hill  she  sees  cat  top

Doc1      1         1      1     1     1      0       0      0     0

Doc2      0        0       0     0     1       1       1     1     1


A document view





A topic view  (Ken Shirley and Carson Sievert)


User defined topics


    1. Define a subset documents of interest within the context
    2. Mark those documents as "interesting"
    3. Use a classifier to find other interesting documents
    4. Accuracy improves with the size of the "interesting" document set


Type 1 error (false positives) tends to be small

Type 2 error (false negatives) can be high



What have we learned?


Document clustering lets us understand the spatial relationships among documents


Document classification looks promising


We need better features


We need different perspectives on a context to understand its constituent documents


What else are we missing?



There are a lot of ways to look at a set of documents


We'd like an integrated research and collaboration platform




Future work



Topic evolution


Integration and user experience


Where's the code?


https://github.com/kaneplusplus/akde-pubmed


https://github.com/cpsievert/LDAvis


http://rcharts.io/


http://www.rstudio.com/shiny/







NYC Stat meetup

By Michael Kane

NYC Stat meetup

  • 2,799