Augmenting Knowledge Discovery in PubMed with R
Michael J. Kane
Phronesis LLC and Yale University
Acknowledgements
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of The Bill and Melinda Gates Foundation or AT&T Labs Research.
What's the point of this talk?
>=
WRT text:
Data we
can collect
Data we
can search
>=
Data we
can understand
What's the point of this talk?
>=
WRT text:
Data we
can collect
Data we
can search
Data we
can understand
>=
What's the point of this talk?
>=
Data we
can collect
>=
Data we
can search
>=
Data we
can analyze
Data we can understand
How much data do we have?
As of August 2012 Facebook stores more than 100 petabytes of data
This is 900720000000000000 (9e17) bits
If a bit were an inch wide: 14215900000000 (1.4e13) miles
- The sun is only about 92000000 (9.2e7) miles away
- 2.41828705 light years
Facebook's daily breakdown
2.7 billion likes made daily on and off of the Facebook site
300 million photos uploaded
70,000 queries executed by people and automated systems
500+ terabytes of new data "ingested"
Who cares about your stupid Facebook posts?
Why are they being stored?
Do they have monetary value?
How are they used?
The Cohen Conjecture
"For increasingly large sets of data, access to individual samples decreases exponentially over time." - David Cohen
The Significance of Search
"The Information Age, also commonly known as the Computer Age or Information Era, is an idea that the current age will be characterized by the ability of individuals to transfer information freely, and to have instant access to knowledge that would have been difficult or impossible to find previously" - Wikipedia
Data + Search
How much data?
Google had indexed about 100 PB as of 2013
"Processing" about 20 PB per day
NSA "touches" 29.21 PB which is 1.6 % of the 1.8 EB transferred over the internet daily.
We can search through PB's of data
What if we don't know what we're looking for?
What if I'm not looking for what everyone else has already found?
The problem with PageRank for research
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Articles in PubMed (http://dan.corlan.net/medline-trend.html)
# Articles in PubMed by year
require(ggplot2)
x <- scan("http://dan.corlan.net/cgi-bin/medline-trend?Q=",
what=character(), sep="\n", skip=1)
x <- x[-length(x)]
x <- gsub("^[ ]+", "", x)
x <- gsub("[ ]+", " ", x)
pm_pubs <- as.data.frame(matrix(
as.numeric(unlist(strsplit(x[-1], " "))), ncol=3,
byrow=TRUE))
names(pm_pubs) <- c("Number", "Year", "Percent")
qplot(Year, Number, data=pm_pubs, geom="line")
Why PubMed?
Huge repository of knowledge in health science
Programmatic accessibility
Enormous demand for interdisciplinary research between data scientists and clinical investigators
What do you mean by "augmented knowledge discovery" and what does it have to to with PubMed?
Computers don't do research (yet)
Individually, humans don't understand a lot of the available research
Can we use machine learning (statistics) to help human discovery in scientific literature?
Use search to create a research context
Rather than searching for specific terms define the domain of interest and explore the landscape of articles returned
That's a nice narrative but what the hell does it mean and how do you implement it?
Document processing
- Retrieve documents from PubMed
- Create term-document matrices
- Apply a learner to the TDM
- Visualize the results
-
Iterate
Retrieving documents from PubMed programmatically
Code is available at https://github.com/kaneplusplus/akde-pubmed
> source("entrez.r")
> query <- 'brucellosis OR "machine learning"'
> max_ids <- 2000
> doc <- pm_doc_info(query, max_ids)
> names(doc)
[1] "title" "author" "date"
[4] "journal" "publication_type" "url"
[7] "title_and_abstract"
> plot_article_activity(query, max_ids)
> journal_hist(query, max_ids)
brucellosis OR "machine learning"
brucellosis OR "machine learning"
Term document matrices
Document 1: He walks the dog up the hill
Document 2: She sees the cat at the top of the hill
He walks dog up hill she sees cat top
Doc1 1 1 1 1 1 0 0 0 0
Doc2 0 0 0 0 1 1 1 1 1
A document view
A topic view (Ken Shirley and Carson Sievert)
User defined topics
- Define a subset documents of interest within the context
- Mark those documents as "interesting"
- Use a classifier to find other interesting documents
- Accuracy improves with the size of the "interesting" document set
Type 1 error (false positives) tends to be small
Type 2 error (false negatives) can be high
What have we learned?
Document clustering lets us understand the spatial relationships among documents
Document classification looks promising
We need better features
We need different perspectives on a context to understand its constituent documents
What else are we missing?
There are a lot of ways to look at a set of documents
We'd like an integrated research and collaboration platform
Future work
Topic evolution
Integration and user experience
Where's the code?
https://github.com/kaneplusplus/akde-pubmed
https://github.com/cpsievert/LDAvis
NYC Stat meetup
By Michael Kane
NYC Stat meetup
- 2,799