NYC Stat meetup

Augmenting Knowledge Discovery in PubMed with R

Michael J. Kane

Phronesis LLC and Yale University

Acknowledgements

The Bill and Melinda Gates Foundation - Round 11 Grand Challenges in Global Health

Carson Sievert (Iowa State University) and Kenneth Shirley (AT&T Labs Research)

Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of The Bill and Melinda Gates Foundation or AT&T Labs Research.

What's the point of this talk?

WRT text:

Data we

can collect

Data we

can search

Data we

can understand

What's the point of this talk?

WRT text:

Data we

can collect

Data we

can search

Data we

can understand

What's the point of this talk?

Data we

can collect

Data we

can search

Data we

can analyze

Data we can understand

How much data do we have?

As of August 2012 Facebook stores more than 100 petabytes of data

This is 900720000000000000 (9e17) bits

If a bit were an inch wide: 14215900000000 (1.4e13) miles

The sun is only about 92000000 (9.2e7) miles away
2.41828705 light years

Facebook's daily breakdown

2.7 billion likes made daily on and off of the Facebook site

300 million photos uploaded

70,000 queries executed by people and automated systems

500+ terabytes of new data "ingested"

Who cares about your stupid Facebook posts?

Why are they being stored?

Do they have monetary value?

How are they used?

The Cohen Conjecture

"For increasingly large sets of data, access to individual samples decreases exponentially over time." - David Cohen

The Significance of Search

"The Information Age, also commonly known as the Computer Age or Information Era, is an idea that the current age will be characterized by the ability of individuals to transfer information freely, and to have instant access to knowledge that would have been difficult or impossible to find previously" - Wikipedia

Data + Search

How much data?

Google had indexed about 100 PB as of 2013

"Processing" about 20 PB per day

NSA "touches" 29.21 PB which is 1.6 % of the 1.8 EB transferred over the internet daily.

We can search through PB's of data

What if we don't know what we're looking for?

What if I'm not looking for what everyone else has already found?

The problem with PageRank for research

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

—Facts about Google and Competition

Articles in PubMed (http://dan.corlan.net/medline-trend.html)

# Articles in PubMed by year
require(ggplot2)
x <- scan("http://dan.corlan.net/cgi-bin/medline-trend?Q=",
  what=character(), sep="\n", skip=1)
x <- x[-length(x)]
x <- gsub("^[ ]+", "", x)
x <- gsub("[ ]+", " ", x)
pm_pubs <- as.data.frame(matrix(
  as.numeric(unlist(strsplit(x[-1], " "))), ncol=3,    
  byrow=TRUE))
names(pm_pubs) <- c("Number", "Year", "Percent")
qplot(Year, Number, data=pm_pubs, geom="line")

Why PubMed?

Huge repository of knowledge in health science

Programmatic accessibility

Enormous demand for interdisciplinary research between data scientists and clinical investigators

What do you mean by "augmented knowledge discovery" and what does it have to to with PubMed?

Computers don't do research (yet)

Individually, humans don't understand a lot of the available research

Can we use machine learning (statistics) to help human discovery in scientific literature?

Use search to create a research context

Rather than searching for specific terms define the domain of interest and explore the landscape of articles returned

That's a nice narrative but what the hell does it mean and how do you implement it?

Document processing

Retrieve documents from PubMed
Create term-document matrices
Apply a learner to the TDM
Visualize the results
Iterate

Retrieving documents from PubMed programmatically

Code is available at https://github.com/kaneplusplus/akde-pubmed

> source("entrez.r")
> query <- 'brucellosis OR "machine learning"'
> max_ids <- 2000
> doc <- pm_doc_info(query, max_ids)
> names(doc)
[1] "title" "author" "date"
[4] "journal" "publication_type" "url"
[7] "title_and_abstract" 

> plot_article_activity(query, max_ids)
> journal_hist(query, max_ids)

brucellosis OR "machine learning"

Term document matrices

Document 1: He walks the dog up the hill

Document 2: She sees the cat at the top of the hill

             He walks dog   up   hill she sees cat top

Doc1      1         1      1     1     1      0       0      0     0

Doc2      0        0       0     0     1       1       1     1     1