Okay, so I am stoked to report that I can now build them pretty wordclouds ! I am even more pleased with how easy the process is. There’s a whole array of plots you can play around with, including :
Commonality Cloud : Allows you to view words common to both corpora
Comparison Cloud: Allows you view words which are not common to both the corpora
Polarized Plot: A better take on the commonality cloud, allowing you to tell which corpra has a greater concentration of a particular word.
Visualized word Network : shows the network of words associated with a main word.
Let’s jump right into it.
Step 1: Load libraries
require("tm") # the text mining package
require("qdap") # for qdap package's cleaning functions
require("twitteR") # to connect to twitter and extract tweets
require("plotrix") # for the pyramid plot
Step 2: Read in your choice of tweets
After connecting to twitter, I downloaded 5000 tweets each found from a search of the key words “hillary” and “trump”. And this was minutes after the US elections 2016 results were declared . Twitter has never been so lit!
hillary<-searchTwitter("hillary",n=5000,lang = "en")
trump<- searchTwitter("trump",n=5000,lang="en")
Step 3: Write and apply functions to perform data transformation and cleaning
a) Function to extract text from the tweets which get downloaded in the list form.We do this using getText which is an accessor method.
convert_to_text <- function(x){
x$getText()
}
b) Function to process our tweets to remove duplicates and urls.
replacefunc <- function(x){
gsub("https://(.*)", "", x)
}
replace_dup <- function(x){
gsub("^(rt|RT)(.*)", "", x)
}
c) Function to further clean the character vector , for example, to remove brackets, replace abbreviations and symbols with their word equivalents and contractions with their fully expanded versions.
clean_qdap <- function(x){
x<- bracketX(x)
x<- replace_abbreviation(x)
x<- replace_contraction(x)
x<- replace_symbol(x)
x<-tolower(x)
return(x)
}
d) Apply the above functions
hillary_text <- sapply(hillary,convert_to_text)
hillary_text1 <- hillary_text
hill_remove_url<- replacefunc(hillary_text1)
hill_sub <- replace_dup(hill_remove_url)
hill_indx <- which(hill_sub=="")
hill_sub_complete <- hill_sub[-hill_indx]
trump_text <- sapply(trump,convert_to_text)
trump_text1 <- trump_text
trump_remove_url<- replacefunc(trump_text1)
trump_sub <- replace_dup(trump_remove_url)
trump_indx <- which(trump_sub=="")
trump_sub_complete <- trump_sub[-trump_indx]
# encode to UTF-8 : capable of encoding all possible characters defined by unicode
trump_sub_complete <- paste(trump_sub_complete,collapse=" ")
Encoding(trump_sub_complete) <- "UTF-8"
trump_sub_complete <- iconv(trump_sub_complete, "UTF-8", "UTF-8",sub='')
#replace non UTF-8 by empty space
trump_clean <- clean_qdap(trump_sub_complete)
trump_clean1 <- trump_clean
hill_sub_complete <- paste(hill_sub_complete,collapse=" ")
Encoding(hill_sub_complete) <- "UTF-8"
hill_sub_complete <- iconv(hill_sub_complete, "UTF-8", "UTF-8",sub='')
#replace non UTF-8 by empty space
hillary_clean <- clean_qdap(hill_sub_complete)
hillary_clean1 <- hillary_clean
Step 4: Convert the character vectors to VCorpus objects
trump_corpus <- VCorpus(VectorSource(trump_clean1)) hill_corpus <- VCorpus(VectorSource(hillary_clean1))
Step 5: Define and apply function to format the corpus object
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords,
c(stopwords("en"),"supporters","vote","election","like","even","get","will","can"
,"amp","still","just","will","now"))
return(corpus)
}
trump_corpus_clean <- clean_corpus(trump_corpus)
hill_corpus_clean <- clean_corpus(hill_corpus)
- Note : qdap cleaner functions can be used with character vectors, but tm functions need a corpus as input.
Step 6: Convert the corpora into TermDocumentMatrix(TDM) objects
Tdmobjecthillary <- TermDocumentMatrix(hill_corpus_clean1) Tdmobjecttrump <- TermDocumentMatrix(trump_corpus_clean1)
Step 7: Convert the TDM objects into matrices
Tdmobjectmatrixhillary <- as.matrix(Tdmobjecthillary) Tdmobjectmatrixtrump <- as.matrix(Tdmobjecttrump)
Step 8: Sum rows and create term-frequency dataframe
Freq <- rowSums(Tdmobjectmatrixhillary) Word_freq <- data.frame(term= names(Freq),num=Freq) Freqtrump <- rowSums(Tdmobjectmatrixtrump) Word_freqtrump <- data.frame(term= names(Freqtrump),num=Freqtrump)
Step 9: Prep for fancier wordclouds
# unify the corpora
cc <- c(trump_corpus_clean,hill_corpus_clean)
# convert to TDM
all_tdm <- TermDocumentMatrix(cc)
colnames(all_tdm) <- c("Trump","Hillary")
# convert to matrix
all_m <- as.matrix(all_tdm)
# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)
# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])
# Combine common_words and difference
common_words <- cbind(common_words, difference)
# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]
# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1],
y = common_words[1:25, 2],
labels = rownames(common_words[1:25, ]))
Step 10: It’s word cloud time!
a) The ‘everyday’ cloud
wordcloud(Word_freq$term, Word_freq$num, scale=c(3,0.5),max.words=1000,
random.order=FALSE, rot.per=0.35, use.r.layout=FALSE,
colors=brewer.pal(5, "Blues"))
wordcloud(Word_freqtrump$term, Word_freqtrump$num, scale=c(3,0.5),max.words=1000,
random.order=FALSE, rot.per=0.35, use.r.layout=FALSE,
colors=brewer.pal(5, "Reds"))


b) The Polarized pyramid plot
# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y, labels = top25_df$labels,
gap = 70, top.labels = c("Trump", "Words", "Hillary"),
main = "Words in Common", laxlab = NULL,
raxlab = NULL, unit = NULL)

c) The comparison cloud
comparison.cloud(all_m, colors = c("red", "blue"),max.words=100)

d) The commonality cloud
commonality.cloud(all_m, colors = "steelblue1",max.words=100)

We made it! That’s it for this post, folks.
Coming up next: Mining deeper into text.


























