An Exploratory Analysis of Text Trends in the G77 (and rest of the UN)


Michael Kane - Yale University and Phronesis

What are the goals for this talk?

Contextualize the exploration (and analysis) of text

 

Demonstrate a preliminary text exploration using announcement from the G77 and UN

 

Define and demonstrate exploratory concepts pioneered by John Tukey

Technologies Used

Python with the selenium package

 

R

  • tm (Feinerer, Hornik, Artifex Software, Inc.)
  • trelliscope/datadr (Hafen)
  • irlba (Lewis, Baglama, Reichel)
  • foreach (Weston, Revolution Analytics)
  • iotools (Urbanek)
  • lubridate (Wickham)

How much can we know about a collection of texts without reading the content?

U.N. and G77 Statements

G77 - loose coalition of developing nations promoting its members' collective economic interests

 

U.N. organs

  • General Assembly - the main deliberative assembly
  • Secretary General - provides studies, information, and facilities needed by the UN
  • Security Council - decides on resolutions for peace and security
  • Economic and Social Council - promotes international economic and social co-operation and development

The data

The date of a statement

 

Who the statement came from (G77, General Assembly, etc.)

 

The text in the statement

 

Let's look at statement volume over time

Taking a step back

Data spans different periods.  (Validation)

 

May be seeing a periodic effect of statement releases or we may bee seeing a relationship between G77 and EASC. (Hypothesis generation)

 

Let's look at content volume over time

"There's nothing better than a picture for making you think of all the questions you forgot to ask" -John Tukey

Are changes in statement frequency  and content volume related to narrative changes?

Defining a change in narrative

A consistent constellation of words over time constitutes normalcy.

 

When the words from a new statement are different than what is normal something has change.

A measure of narrative novelty

The proportion of new words that appear at time t compared to the words at time t-1.

N(X_t | X_{t-1}) = \frac{ |X_t \setminus X_{t-1}| }{|X_t|}
N(XtXt1)=XtXtXt1

"It seems natural to call such computer guided diagnostics cognostics. We must learn to choose them, calculate them, and use them. Else we drown in a sea of many displays" -John Tukey

What happened in August 2012?

What if we want to compare individual statements?

A toy example

Consider a corpus from a "language" with only two words:

  • "He runs."
  • "He he he he runs."
  • "He he he runs."
  • "He runs runs runs."
  • "He he runs runs runs."

 

Create a Term-Document Matrix

he, runs

   1,     1

   4,     1

   3,     1  

   1,     3

   2,     3

Plot the documents in the word space

Going beyond the toy example

Mathematical tools that can reduce the dimensionality while trying to preserve the salient differences.

 

Other tools find clusters and establish relationships between documents and other values of interest.

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." - John Tukey

What comes next?

We've seen how cognostics help us prioritize our investigation of graph.

 

We've generated hypotheses.

 

This is usually where we propose models for testing hypotheses.

If you are interested in the modeling portion...

Unsupervised

  • (Probabilistic) Latent Semantic Analysis
  • Latent Dirichlet Allocation

Supervised

  • Supervised Latent Dirichlet Allocation
  • Support Vector Machine

 

If you are interested in further analysis of the G77 Statements...

Casey King (casey@gophronesis.com) has been looking at the relationship between U.N. voting records, G77 statements, and U.S. Aid.

 

Thanks!

U.N. Exploratory Data Analysis

By Michael Kane

U.N. Exploratory Data Analysis

  • 2,237