Some Basic Text on the Mueller Report

So this Robert Mueller guy wrote a report

I may as well analyse it a bit.

First, let me see if I can get a hold of the data. I grabbed the report directly from the Department of Justice website. You can follow this link.

library(tidyverse)
library(pdftools)
# Download report from link above
mueller_report_txt <- pdf_text("../data/report.pdf")
# Create a tibble of the text with line numbers and pages
mueller_report <- tibble(
  page = 1:length(mueller_report_txt),
  text = mueller_report_txt) %>% 
  separate_rows(text, sep = "\n") %>% 
  group_by(page) %>% 
  mutate(line = row_number()) %>% 
  ungroup() %>% 
  select(page, line, text)
write_csv(mueller_report, "data/mueller_report.csv")

Now I can use a .csv of the data; reading the .pdf and hacking it up takes time.

library(pdftools)
library(here)
library(tidyverse)
load("MuellerReport.RData")
head(mueller_report)
##   page line
## 1    1    1
## 2    1    2
## 3    1    3
## 4    1    4
## 5    1    5
## 6    1    6
##                                                                                                text
## 1                                                                        U.S. Department of Justice
## 2 AttarAe:,c \\\\'erlc Predtiet // Mtt; CeA1:ttiA Ma1:ertal Prn1:eeted UAder Fed. R. Crhtt. P. 6(e)
## 3                                                                  Report On The Investigation Into
## 4                                                                       Russian Interference In The
## 5                                                                        2016 Presidential Election
## 6                                                                                    Volume I of II

The text is generally pretty good though there is some garbage. The second line contains redactions and those are the underlying cause. In fact, every page contains this same line though they convert to text in a non-uniform fashion.

mueller_ml2 <- mueller_report %>% dplyr::filter(line != 2) 

I want to make use of cleanNLP to turn this into something that I can analyze. The first step is to get rid of the tidyness, of sorts.

Once upon a time, this worked with the linux tools and others.  The spacy and corenlp functionality is not native R and the python interface is currently broken, at least on this server.
library(tidyverse)
library(RCurl)
library(tokenizers)
library(cleanNLP)
# cnlp_download_spacy("en")
MRep <- paste(as.character(mueller_ml2$text), " ")
# cnlp_init_stringi()
# starttime <- Sys.time()
# stringi_annotate <- MRep %>% as.character() %>% cnlp_annotate(., verbose=FALSE) 
# endtime <- Sys.time()

I wanted to find the bigrams while removing stop words. Apparently, the easiest way to do this is quanteda. I got this from stack overflow

library(widgetframe)
## Loading required package: htmlwidgets
library(quanteda)
## Package version: 1.5.2
## Parallel computing: 2 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(wordcloud)
## Loading required package: RColorBrewer
myDfm <- tokens(MRep) %>%
    tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
    tokens_remove(stopwords("english"), padding  = TRUE) %>%
    tokens_ngrams(n = 2) %>%
    dfm()
wc2 <- topfeatures(myDfm, n=150, scheme="count")
wc2.df <- data.frame(word = names(wc2), freq=as.numeric(wc2))
wc2.df$word <- as.character(wc2.df$word)
wc2.df <- wc2.df %>% filter(freq < 300)
# wordcloud(wc2.df, size=0.4)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
frameWidget(hchart(wc2.df, "wordcloud", hcaes(name=word, weight=freq/30)))
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

pdfpages: A little plot

I found some instructions on constructing the entire document on a grid and pulled the report apart to visualise it in that way.

library(pdftools)
library(png)
pdf_convert("data/report.pdf")
 
# Dimensions of 1 page.
imgwidth <- 612
imgheight <- 792
 
# Grid dimensions.
gridwidth <- 30
gridheight <- 15
 
# Total plot width and height.
spacing <- 1
totalwidth <- (imgwidth+spacing) * (gridwidth)
totalheight <- (imgheight+spacing) * gridheight
 
# Plot all the pages and save as PNG.
png("RSMReport.png", round((imgwidth+spacing)*gridwidth/7), round((imgheight+spacing)*gridheight/7))
par(mar=c(0,0,0,0))
plot(0, 0, type='n', xlim=c(0, totalwidth), ylim=c(0, totalheight), asp=1, bty="n", axes=FALSE)
for (i in 1:448) {
    fname <- paste("report_", i, ".png", sep="")
    img <- readPNG(fname)
     
    x <- (i %% gridwidth) * (imgwidth+spacing)
    y <- totalheight - (floor(i / gridwidth)) * (imgheight+spacing)
     
    rasterImage(img, xleft=x, ybottom = y-imgheight, xright = x+imgwidth, ytop=y)
}
dev.off()
A Graphic

A Graphic

Avatar
Robert W. Walker
Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Next
Previous

Related