Module 14 Basics of ggplot

Learning goals

  • Understand what ggplot2 is and why it’s used
  • Be able to think conceptually in the framework of the “grammar of graphics”
  • Learn the syntax for creating different plots using using ggplot2

Thinking about data visualization

Let’s start with a video: https://youtu.be/5Zg-C8AAIGg.

And one by Hans Rosling too: https://www.youtube.com/watch?v=FACK2knC08E

What is ggplot?

ggplot2 is an R package. It’s one of the most downloaded packages in the R universe, and has become the gold standard for data visualization. It’s extremely powerful and flexible, and allows for creating lots of visualizations of different types, ranging from maps to bare-bones academic publications, to complex, paneled charts with labeling, etc. Because the syntax is so different from “base” R, it can give the impression of having a somewhat steep learning curve. But in reality, because the principles are so conceptually simple, learning is fairly fast. Generally those who choose to learn it stick with it; that is, once you go gg, you don’t go back.

Note: we will refer heavily to this online guide about ggplot

The name and concept

“GG” stands for “grammar of graphics”, with “grammar” meaning “the fundamental principles or rules of an art or science” (Wickham 2010). The most well-known “grammar of graphics” was written in 2005 and laid out some abstract principles for describing statistical graphics (Wilkinson 2005). The basic idea is that all graphs can be described using a layered grammar, and that all graphs have the same general elements…

  • data
  • aesthetics (mapping) of variables to objects
  • geometric objects

… whereas some graphs have additional elements…

  • statistical transformations
  • labs
  • facets
  • themes

Practice

Let’s learn by doing. First, load the titanic data set:

library(ggplot2)
library(readr)
library(ggthemes)
library(dplyr)
titanic<- read_csv("https://raw.githubusercontent.com/databrew/intro-to-data-science/main/data/deaths.csv")

Scatter plot

Call the plot:

ggplot()

Aesthetics (mapping variables to the x y plane):

ggplot(data = titanic, 
       aes(x = Age, y = Fare))

Geometry:

ggplot(data = titanic, 
       aes(x = Age, y = Fare)) + 
  geom_point()

Statistics:

ggplot(data = titanic, 
       aes(x = Age, y = Fare)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Labels:

ggplot(data = titanic, 
       aes(x = Age, y = Fare)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'Relationship between Age and Fare')

Facets:

ggplot(data = titanic, 
       aes(x = Age, y = Fare)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'Relationship between Age and Fare') +
  facet_wrap(~Sex)

Themes:

ggplot(data = titanic, 
       aes(x = Age, y = Fare)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'Relationship between Age and Fare') +
  facet_wrap(~Sex) + 
  theme_fivethirtyeight()

Barplot

Get counts for sex:

ggplot(data = titanic, 
       aes(x = Sex)) + 
  geom_bar(stat='count')

Explicitly set y to another variable

titanic_age <- titanic %>% 
  group_by(Sex) %>% 
  summarise(mean_age = mean(Age, na.rm= TRUE))

ggplot(data = titanic_age, 
       aes(x = Sex,y= mean_age)) + 
  geom_bar(stat = 'identity')

Fill

ggplot(data = titanic_age, 
       aes(x = Sex,y= mean_age)) + 
  geom_bar(stat = 'identity', fill = 'blue', alpha = 0.5) 

Labs

ggplot(data = titanic_age, 
       aes(x = Sex,y= mean_age)) + 
  geom_bar(stat = 'identity', fill = 'blue', alpha = 0.5) + 
  labs(y = 'Mean age', title = 'Average age by sex')

Add another variable

titanic_em <- titanic %>% 
  group_by(Sex, Survived) %>% 
  summarise(mean_age= mean(Age, na.rm = TRUE))

titanic_em <- titanic_em %>% 
  mutate(Survived = ifelse(Survived == 1, 'Survived','Dead' ))

ggplot(data = titanic_em, 
       aes(x=Sex, y=mean_age, fill = Survived)) +
  geom_bar(stat='identity')

Dodge

ggplot(data = titanic_em, 
       aes(x=Sex, y=mean_age, fill = Survived)) +
  geom_bar(stat='identity', position = 'dodge') 

Scales

ggplot(data = titanic_em, 
       aes(x=Sex, y=mean_age, fill = Survived)) +
  geom_bar(stat='identity', position = 'dodge', alpha = 0.7) +
  scale_fill_manual(values = c('blue', 'grey'))

Practice 1

library(babynames)
library(dplyr)
library(ggplot2)
bn <- babynames

# subset by the name Francis
francis <- bn %>% filter(name == 'Francis', sex == 'M')

# line plot
ggplot(data=francis, 
       aes(x = year, y =n )) + 
  geom_line() + 
  labs(x = 'Year', y = 'Number of boys named Francis')

# layer with points
ggplot(data=francis, 
       aes(x = year, y =n )) + 
  geom_line() + 
  geom_point() + 
  labs(x = 'Year', y = 'Number of boys named Francis')

Practice exercises 1

  1. Create a line chart showing the number of girls named Mary over time

  2. Change the color of the line to blue

  3. Add a title to the plot “The name Mary over time”

  4. Create a bar chart showing the number of girls named Emma, Olivia, Ava, Sophia, and Emily in 2010

  5. Change the X label to “Names” and the y label to “Total”

  6. Change the color of the bar to grey and make it more transparent

  7. Create a bar chart showing the number of people named Emma, Olivia, Ava, Sophia, and Emily in 2010, colored by sex

  8. Create a cool-looking chart showing your name over time

Practice exercises 2

First, let’s read in some data on health from the World Bank:

library(readr)
library(dplyr)
library(gapminder)
gm <- gapminder::gapminder
  1. How many rows are in the dataset?

  2. How many clumns are in the dataset?

  3. What are the names of the columns?

  4. What is the oldest year in the dataset?

  5. What is the country/year with the greatest population in the dataset?

  6. Get the average GDP per capita for each continent in 1952.

  7. Get the average GDP per capita for each continent for the most recent year in the dataset.

  8. Average GDP is a bit misleading, since it does not take into account the relative size (in population) of the different countries (ie, China is a lot bigger than Cambodia). Look up the function weighted.mean. Use it to get the average life expectancy by continent for the most recent year in the dataset, weighted by population.

  9. Make a barplot of the above table (ie, average life expectancy by continent, weighted by population).

  10. Make a point plot in which the x-axis is country, and the y-axis is GDP. Add the line theme(axis.text.x = element_text(angle = 90)) in order to make the x-axis text vertically aligned. What’s the problem with this plot? How many points are there per country?

  11. Make a new version of the above, but filter down to just the earliest year in the dataset.

  12. Make a scatterplot of life expectancy and GDP per capita, just for 1972.

  13. Make the same plot as above, but for the most recent year in the data.

  14. Make the same plot as the above, but have the size of the points reflect the population.

  15. Make the same plot as the above, but have the color of te points reflect the continent.

  16. Filter the data down to just the most recent year in the data, and make a histogram (geom_histogram) showing the distribution of GDP per capita.

  17. Get the average GDP per capita for each continent/year, weighted by the population of each country.

  18. Using the data created above, make a plot in which the x-axis is year, the y-axis is (weighted) average GDP per capita, and the color of the lines reflects the content.

  19. Make the same plot as the above, but facet the plot by continent.

  20. Make the same plot as the above, but remove the coloring by continent.

  21. Make a plot showing France’s population over time.

  22. Make a plot showing all European countries’ population over time, with color reflecting the name of the country.

  23. Create a variable called status. If GDP per capita is over 20,000, this should be “rich”; if between 5,000 and 20,000, this should be “middle”; if this is less than 5,000, this should be “poor”.

  24. Create an object with the number of rich countries per year.

  25. Create an object with the percentage of countries that were rich each year.

  26. Create a plot showing the percentage of countries which were rich each year.

  27. Create an object with the number of people living in poor countries each year.

  28. Create a chart showing the number of people living in rich, medium, and poor countries per year (line chart, coloring by status).

  29. Create a chart showing the life expectancy in Somalia over time.

  30. Create a chart showing GDP per capita in Somalia over time.

  31. Create a histogram of life expectancy for the most recent year in the data. Facet this chart by continent.

  32. Create a barchart showing average continent-level GDP over time, weighted for population, with one bar for each year, stacked bars with the color of the bars indicating continent (geom_bar(position = 'stack')).

  33. Create the same chart as above, but with bars side-by-side (geom_bar(position = 'dodge'))

  34. Generate 3-5 more charts / tables that show interesting things about the data.

  35. Make the above charts as aesthetically pleasing as possible.

References

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.

Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Berlin, Heidelberg: Springer-Verlag.