DataCamp’s Chief Data Scientist David Robinson set out to use his tidytext package (co-authored with Julia Silge) to identify the author of the anonymous op-ed that appeared in the New York Times last week.
Robinson catalogues the difficult problems he faced in his analysis — how to get samples of text written by senior administration officials? — and the uncertainty of any result.
Robinson writes all about the process on Variance Explained.
This is a useful opportunity to demonstrate how to use the tidytext package that Julia Silge and I developed, and in particular to apply three methods:
- Using TF-IDF to find words specific to each document (examined in more detail in Chapter 3 of our book)
- Using widyr to compute pairwise cosine similarity
- How to make similarity interpretable by breaking it down by word
Since my goal is R education more than it is political analysis, I show all the code in the post.
Robinson explains that he would still be skeptical about the results of the text mining exercise because it was such a small sample. Nevertheless, Robinson could not help but conclude that the anonymous writer came from the State Department — but we won’t spoil the results here.