Michael J. Kane
Phronesis LLC and Yale University
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of The Bill and Melinda Gates Foundation or AT&T Labs Research.
>=
WRT text:
Data we
can collect
Data we
can search
>=
Data we
can understand
>=
WRT text:
Data we
can collect
Data we
can search
Data we
can understand
>=
Data we
can collect
>=
Data we
can search
>=
Data we
can analyze
Data we can understand
This is 900720000000000000 (9e17) bits
If a bit were an inch wide: 14215900000000 (1.4e13) miles
2.7 billion likes made daily on and off of the Facebook site
300 million photos uploaded
70,000 queries executed by people and automated systems
500+ terabytes of new data "ingested"
Why are they being stored?
Do they have monetary value?
How are they used?
"The Information Age, also commonly known as the Computer Age or Information Era, is an idea that the current age will be characterized by the ability of individuals to transfer information freely, and to have instant access to knowledge that would have been difficult or impossible to find previously" - Wikipedia
Data + Search
Google had indexed about 100 PB as of 2013
"Processing" about 20 PB per day
NSA "touches" 29.21 PB which is 1.6 % of the 1.8 EB transferred over the internet daily.
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
# Articles in PubMed by year
require(ggplot2)
x <- scan("http://dan.corlan.net/cgi-bin/medline-trend?Q=",
what=character(), sep="\n", skip=1)
x <- x[-length(x)]
x <- gsub("^[ ]+", "", x)
x <- gsub("[ ]+", " ", x)
pm_pubs <- as.data.frame(matrix(
as.numeric(unlist(strsplit(x[-1], " "))), ncol=3,
byrow=TRUE))
names(pm_pubs) <- c("Number", "Year", "Percent")
qplot(Year, Number, data=pm_pubs, geom="line")
Huge repository of knowledge in health science
Programmatic accessibility
Enormous demand for interdisciplinary research between data scientists and clinical investigators
Computers don't do research (yet)
Individually, humans don't understand a lot of the available research
Can we use machine learning (statistics) to help human discovery in scientific literature?
Rather than searching for specific terms define the domain of interest and explore the landscape of articles returned
Code is available at https://github.com/kaneplusplus/akde-pubmed
> source("entrez.r")
> query <- 'brucellosis OR "machine learning"'
> max_ids <- 2000
> doc <- pm_doc_info(query, max_ids)
> names(doc)
[1] "title" "author" "date"
[4] "journal" "publication_type" "url"
[7] "title_and_abstract"
> plot_article_activity(query, max_ids)
> journal_hist(query, max_ids)
Document 1: He walks the dog up the hill
Document 2: She sees the cat at the top of the hill
He walks dog up hill she sees cat top
Doc1 1 1 1 1 1 0 0 0 0
Doc2 0 0 0 0 1 1 1 1 1
Type 1 error (false positives) tends to be small
Type 2 error (false negatives) can be high
Document clustering lets us understand the spatial relationships among documents
Document classification looks promising
We need better features
We need different perspectives on a context to understand its constituent documents
There are a lot of ways to look at a set of documents
We'd like an integrated research and collaboration platform
Topic evolution
Integration and user experience
https://github.com/kaneplusplus/akde-pubmed
https://github.com/cpsievert/LDAvis