How to get a free bounty in big data analytics

Here is my scoring of the sentiments for Donald Trump and HC, hope it is what you want :)

Analyse sentiments TRUMP

library(devtools) install_github("twitteR", username="geoffjentry")# We use a workaround for the connection library(twitteR)

api_key = "JvuevdaLavHB0fOn8fvKvwyew" api_secret = "s4wKiGN3x8QNcGBb9yVupmZuA1yBJNoObf8oWuRFL24pMq3ebF" access_token = "276035007-MYoKkYssGhBshBcwPPGKyrkDdJFVq4CNPD9HOVdK" access_token_secret = "FfwPDAEHjR1DWJQ51ISNoGMddtjVqpuldoRePrUW5OEO1" setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

DT.tweets=searchTwitter("@realDonaldTrump",n=5000,geocode='40.375,-100,1500mi')

Sentiment Analysis across brands

1. Data identification

Visit http://www.twitter.com/urbanoutfitters

Visit http://www.twitter.com/Abercrombie

Visit http://www.twitter.com/Forever21

2.1 Data extraction (connection)

Load the twitteR, tm, ggplot2, wordcloud and snowballC packages

DT.tweets=searchTwitter("@realDonaldTrump",n=5000,geocode='40.375,-100,1500mi') HC.tweets=searchTwitter("@HillaryClinton",n=5000)

2.2 Extract text from lexicons

pos.words = scan('positive-words.txt',what='character', comment.char=';') neg.words = scan('negative-words.txt',what='character', comment.char=';')

3.1.2 Write in function to score sentiment

library(plyr)

score.sentiment = function(sentence, pos.words, neg.words, .progress='none') { # function to score the sentiments require(plyr) require(stringr)

# we got a vector of sentences. plyr will handle a list # or a vector as an "l" for us # we want a simple array ("a") of scores back, so we use # "l" + "a" + "ply" = "laply": scores = laply(sentence, function(sentence, pos.words, neg.words) {

# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)

# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)

# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)

# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)

# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)

return(score)

}, pos.words, neg.words, .progress=.progress )

scores.df = data.frame(score=scores, text=sentence) return(scores.df) }

clean.tweets <- function(tweets.df){ # Function to clean the data twlist<-twListToDF(tweets.df) datatemp <- unlist(strsplit(twlist$text, split=", ")) # remove usernames datatemp<-gsub("@[[:alnum:]]","",datatemp) # to ASCII datatemp <- iconv(datatemp, "latin1", "ASCII", sub="") datatemp <- str_replace_all(datatemp,"[[:graph:]]", " ") # remove punctuation datatemp<-gsub("[[:punct:]]", "", datatemp) # remove htpp datatemp<-gsub("http[[:alnum:]]","",datatemp) # remove numbers datatemp<-gsub("\d", "",datatemp) # remove unrecognized chars datatemp<-gsub("�", "",datatemp) # remove "stop words" datatemp<-removeWords(datatemp,stopwords('english)) # Strip whitespace datatemp<-stripWhitespace(datatemp) # to lowercase datatemp <-tolower(datatemp) return(datatemp) }

4.1 Score tweets' sentiment

DT.score=score.sentiment(clean.tweets(DT.tweets), pos.words, neg.words, .progress='text') HT.score=score.sentiment(clean.tweets(HC.tweets), pos.words, neg.words, .progress='text')

4.2 Configure colums for further plotting

DT.score$candidate="Donald Trump" DT.score$code="DT" HT.score$candidate="Hillary Clinton" HT.score$code="HC"

4.3 Bind scores for brands

brands.score=rbind(DT.score, HT.score)

5 Data visualization

plot of the score by brand

library(ggplot2) g = ggplot(data=brands.score, mapping=aes(x=score, fill=candidate) ) g = g + geom_histogram(binwidth=1) # Do a histogram g = g + facet_grid(candidate~.) # Have a different plot for each brand g = g + theme_bw() + scale_fill_brewer() # Define the colors (blue in a b&w theme) g

/r/ESSECAnalytics Thread Link - drive.google.com