- This is a review exercise: apply the
ggplotskills introduced in the previous modules and the
Deep Rmodules on Working with Dates and Times and Working with Text to doing some text mining of former-President Trump’s tweets.
Let’s run the below to get started.
1. In the current format, one row of data is equal to one
2. Create a variable called
line. This should be 1, 2, 3, 4, etc.
3. Create a variable called
text. This should be an exact copy of
4. Use the
unnest_tokens function to reshape the data for better text processing.
5. What format is the data in now (ie, one row is equal to
6. Take a minute to read about the
tidytext package at https://www.tidytextmining.com/tidytext.html.
7. What is the most common word used by Trump?
substr to create a
9. What is the most common word used by Trump each year?
10. Create a variable named
11. What is the most common word used by Trump each month?
12. Create a dataframe with one word per row, and a column called
freq saying how many times that word was used.
13. Load up the
14. Subset the dataframe created in number 12 to only include the top 100 words.
15. Create a wordcloud of Trump’s top 100 words.
16. Are you ready to do some sentiment analysis? Great.
17. Create a dataframe named
sentiments by running the following:
sentiments <- read_csv('https://raw.githubusercontent.com/databrew/intro-to-data-science/main/data/sentiments.csv')
18. What is the
19. Create another dataset named
polarity by running the following:
polarity <- get_sentiments("afinn")
left_join to combine polarity and sentiments into one dataset named
left_join to combine the
trump data and the
22. Have a look at the
simple (Trump) data. What do you see?
23. Get an overall polarity score (using the
value variable) for the entire dataset. Is it positive or negative?
24. How many words were emotionally associated with “anger” in 2015?
25. What percentage of words were associated with “fear” by year?
26. What is the average sentiment polarity by year?
27. What is Trump’s most positive tweet?
28. What month was Trump’s most negative month?
29. What percentage of Trump tweets have more sadness than joy by year/month?
30. Read in data on full moons by running the following:
moon <- read_csv('https://raw.githubusercontent.com/databrew/intro-to-data-science/main/data/full-moon.csv')
31. Create a
date column with a correctly formatted date.
32. What day of the week has the most full moons?
left_join to bring the moon data into the Trump data.
34. Does Trump have more negative emotions on full moon days?
35. Read in “stop words” by running the following:
sw <- read_csv('https://raw.githubusercontent.com/databrew/intro-to-data-science/main/data/stopwords.csv')
36. Join the
sw data to the
simple data, and remove the stop words.
37. Create a new word cloud.
38. Do a new analysis of sentimentality.