DataCamp’s Chief Data Scientist David Robinson set out to use his tidytext package (co-authored with Julia Silge) to identify the author of the anonymous op-ed that appeared in the New York Times last week.

Robinson catalogues the difficult problems he faced in his analysis — how to get samples of text written by senior administration officials? — and the uncertainty of any result.

Robinson writes all about the process on Variance Explained.

This is a useful opportunity to demonstrate how to use the tidytext package that Julia Silge and I developed, and in particular to apply three methods:

  • Using TF-IDF to find words specific to each document (examined in more detail in Chapter 3 of our book)
  • Using widyr to compute pairwise cosine similarity
  • How to make similarity interpretable by breaking it down by word

Since my goal is R education more than it is political analysis, I show all the code in the post.

Robinson explains that he would still be skeptical about the results of the text mining exercise because it was such a small sample. Nevertheless, Robinson could not help but conclude that the anonymous writer came from the State Department — but we won’t spoil the results here.