An Exploratory Analysis of Text Trends in the G77 (and rest of the UN)
Michael Kane - Yale University and Phronesis
What are the goals for this talk?
Contextualize the exploration (and analysis) of text
Demonstrate a preliminary text exploration using announcement from the G77 and UN
Define and demonstrate exploratory concepts pioneered by John Tukey
Technologies Used
Python with the selenium package
R
- tm (Feinerer, Hornik, Artifex Software, Inc.)
- trelliscope/datadr (Hafen)
- irlba (Lewis, Baglama, Reichel)
- foreach (Weston, Revolution Analytics)
- iotools (Urbanek)
- lubridate (Wickham)
How much can we know about a collection of texts without reading the content?
U.N. and G77 Statements
G77 - loose coalition of developing nations promoting its members' collective economic interests
- General Assembly - the main deliberative assembly
- Secretary General - provides studies, information, and facilities needed by the UN
- Security Council - decides on resolutions for peace and security
- Economic and Social Council - promotes international economic and social co-operation and development
The data
The date of a statement
Who the statement came from (G77, General Assembly, etc.)
The text in the statement
Let's look at statement volume over time
Taking a step back
Data spans different periods. (Validation)
May be seeing a periodic effect of statement releases or we may bee seeing a relationship between G77 and EASC. (Hypothesis generation)
Let's look at content volume over time
"There's nothing better than a picture for making you think of all the questions you forgot to ask" -John Tukey
Are changes in statement frequency and content volume related to narrative changes?
Defining a change in narrative
A consistent constellation of words over time constitutes normalcy.
When the words from a new statement are different than what is normal something has change.
A measure of narrative novelty
The proportion of new words that appear at time t compared to the words at time t-1.
"It seems natural to call such computer guided diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays" -John Tukey
What happened in August 2012?
What if we want to compare individual statements?
A toy example
Consider a corpus from a "language" with only two words:
- "He runs."
- "He he he he runs."
- "He he he runs."
- "He runs runs runs."
- "He he runs runs runs."
Create a Term-Document Matrix
he, runs
1, 1
4, 1
3, 1
1, 3
2, 3
Plot the documents in the word space
Going beyond the toy example
Mathematical tools that can reduce the dimensionality while trying to preserve the salient differences.
Other tools find clusters and establish relationships between documents and other values of interest.
"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." - John Tukey
What comes next?
We've seen how cognostics help us prioritize our investigation of graph.
We've generated hypotheses.
This is usually where we propose models for testing hypotheses.
If you are interested in the modeling portion...
Unsupervised
- (Probabilistic) Latent Semantic Analysis
- Latent Dirichlet Allocation
Supervised
- Supervised Latent Dirichlet Allocation
- Support Vector Machine
If you are interested in further analysis of the G77 Statements...
Casey King (casey@gophronesis.com) has been looking at the relationship between U.N. voting records, G77 statements, and U.S. Aid.
Thanks!
U.N. Exploratory Data Analysis
By Michael Kane
U.N. Exploratory Data Analysis
- 2,237