Module 18 Dataframe wrangling
Learning goals
- Understand the importance of tidy dataframes
- Understand what the
tidyverse
is and why it is awesome - Feel comfortable working with dataframes using
dplyr
functions.
The dplyr
package
Data scientists largely work in data frames and do things to data. This is what the package dplyr
is optimized for. It consists of a series of “verbs” which cover 95% of what you need to do for most basic data processing tasks.
The dplyr
package contains a set of verbs
: things you do to dataframes. Those verbs are:
filter()
arrange()
select()
rename()
distinct()
mutate()
summarise()
The %>%
pipe
%>%
is a “pipe”. It is a way to write code without so many parentheses. For example, what if I want to find the square root of the sum of the first six elements of a sequence of 10 to 20 by 2?
Here’s what that command would look like in base R
:
Pretty overwhelming, and pretty easy to make errors in writing it out.
But the above could also be written a simpler way:
When you see the %>%
pipe symbol, think of the word “then”.
The above code could be read aloud like so: “First, get a sequence of every second number between 10 and 20. Then, take the first six values. Then, sum those samples together. Then, take the square root of that sum.”
Using the %>%
pipe framework, your code turns from a nonlinear series of parentheses and brackets to a linear progression of steps, which is a closer fit to how we tend to think about working with data. Instead of working from the inside of a command outward, we thinking linearly: take the data, then do things with it, then do more things with it, etc.
Here’s another example:
… could also be written as:
dplyr
verbs
To practice the dplyr
verbs, let’s make a small dataframe named people
:
people <- data.frame(who = c('Joe', 'Ben', 'Xing', 'Coloma'),
sex = c('Male', 'Male', 'Female', 'Female'),
age = c(35, 33, 32, 34))
people
who sex age
1 Joe Male 35
2 Ben Male 33
3 Xing Female 32
4 Coloma Female 34
filter()
The filter()
function is used to subset a dataframe, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE
for all conditions.
You can also filter according to multiple conditions. Here are three ways to acheive the same thing:
Note that when a condition evaluates to NA
, its row will be dropped. This differ from the base subsetting works with [ ... ]
.
arrange()
Arrange means putting things in order. That is, arrange()
orders the rows of a data frame by the values of selected columns.
To reverse the order, use desc()
:
people %>% arrange(desc(age))
who sex age
1 Joe Male 35
2 Coloma Female 34
3 Ben Male 33
4 Xing Female 32
You can also arrange by multiple levels:
select()
Select only certain variables in a data frame, making the dataframe skinnier (fewer columns).
As you select columns, you can rename them like so:
You can also select a set of columns using the :
notation:
rename()
The function rename()
changes the names of individual variables.
This verb takes the syntax <new_name> = <old_name>
syntax.
mutate()
The function mutate()
adds new variables and preserves existing ones.
New variables overwrite existing variables of the same name.
people %>% mutate(agein2020 = age - 1)
who sex age agein2020
1 Joe Male 35 34
2 Ben Male 33 32
3 Xing Female 32 31
4 Coloma Female 34 33
people %>% mutate(is_male = sex == 'Male')
who sex age is_male
1 Joe Male 35 TRUE
2 Ben Male 33 TRUE
3 Xing Female 32 FALSE
4 Coloma Female 34 FALSE
people %>% mutate(average_age = mean(age))
who sex age average_age
1 Joe Male 35 33.5
2 Ben Male 33 33.5
3 Xing Female 32 33.5
4 Coloma Female 34 33.5
You can call mutate()
multiple times in the same pipe:
people %>% mutate(average_age = mean(age)) %>%
mutate(diff_from_avg = age - average_age)
who sex age average_age diff_from_avg
1 Joe Male 35 33.5 1.5
2 Ben Male 33 33.5 -0.5
3 Xing Female 32 33.5 -1.5
4 Coloma Female 34 33.5 0.5
You can also remove variables can be removed by setting their value to NULL.
A similar function, transmute()
, adds new variables and drops existing ones, kind of like a combination of select()
and mutate()
.
group_by()
Most data operations are done on groups defined by variables. The function group_by()
takes an existing table and converts it into a grouped one where operations are performed “by group”.
people %>%
group_by(sex) %>%
mutate(average_age_for_sex = mean(age))
# A tibble: 4 × 4
# Groups: sex [2]
who sex age average_age_for_sex
<chr> <chr> <dbl> <dbl>
1 Joe Male 35 34
2 Ben Male 33 34
3 Xing Female 32 33
4 Coloma Female 34 33
people %>%
group_by(sex) %>%
mutate(average_age_for_sex = mean(age)) %>%
mutate(diff_from_avg_for_sex = age - average_age_for_sex)
# A tibble: 4 × 5
# Groups: sex [2]
who sex age average_age_for_sex diff_from_avg_for_sex
<chr> <chr> <dbl> <dbl> <dbl>
1 Joe Male 35 34 1
2 Ben Male 33 34 -1
3 Xing Female 32 33 -1
4 Coloma Female 34 33 1
Note that a similar verb, ungroup()
, removes grouping.
summarize()
summarize()
or summarize()
creates an entirely new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
people %>%
summarize(average_age = mean(age),
standard_dev_of_age = sd(age),
oldest_age = max(age),
youngest_age = min(age))
average_age standard_dev_of_age oldest_age youngest_age
1 33.5 1.290994 35 32
people %>%
group_by(sex) %>%
summarise(avg_age = mean(age),
oldest_age = max(age),
total_years = sum(age))
# A tibble: 2 × 4
sex avg_age oldest_age total_years
<chr> <dbl> <dbl> <dbl>
1 Female 33 34 66
2 Male 34 35 68
people %>%
group_by(sex) %>%
summarise(sample_size = n())
# A tibble: 2 × 2
sex sample_size
<chr> <int>
1 Female 2
2 Male 2
Note the use of the function, n()
. This simple function counts up the number of records in each group.
Instructor tip:
To illustrate these dplyr
verbs and re-energize the room, ask everyone to stand. Tell the students that they represent a dataframe called people
. Now, write a dplyr
command into your R Console and ask them to act out the command. After each command, give them time to move around to act it out. If they move around too slowly, egg them on: “Come on, you all are like the slowest computer ever!”
people %>% arrange(shoe_size)
people %>% arrange(shoe_size) %>% filter(sex == "female")
people %>% arrange(hair_length)
people %>% arrange(desc(hair_length))
people %>% group_by(sex) %>% arrange(hair_length))
people %>% arrange(country_of_birth, shirt_color, desc(shoe_size))
etc.
Exercises
Answer these questions using the new dplyr
verbs you just learned:
Baby names over time
1. Run the below code to load a dataset about baby names given in the USA since the 1800’s.
2. Check out the first and last six rows of bn
.
3. What are the names of the variables in this dataset?
4. How many rows are in this dataset?
5. What is the earliest year in this dataset?
6. Create a dataframe named turn_of_century
, which contain data only for the year 1900.
7. Create a dataframe named boys
, containing only boys.
8. Create a dataframe named moms_gen
. This should be females born in the year of birth of your mom.
9. Order moms_gen
by n
, in ascending order (i.e., with the least popular name at top). Look at the result; what is the least popular name among women the year your mom was born?
10. Reverse the order and save the result into an object named moms_gen_ordered
.
11. Create an object named boys2k
. This should be all males born in the year 2000.
12. Arrange boys2k
from most to least popular. What was the most popular boys name in 2000?
13. What percentage of boys were named Joseph
in 2000?
14. Were there more Jims or Matthews in 2020?
15. Create an object named tot_names_by_year
, which contains the total counts for boy and girl names assigned in each year of the dataset. You should have four columns: year
, boys
, girls
, and tot
.
16. How many people were born with your name in 2020?
17. Was your name more prevalent in 2020 than it was in the year you were born?
18. What if you account for the changing overall population size? In other words, is the proportional prevalence of your name greater in 2020 or your birth year?
19. In which year was your name the most prevalent?
20. Create a basic plot of the proportional prevalence of your name since the earliest year of this dataset.
21. Update this plot with lines for your parent’s names and your siblings names, if you have any.
22. Format that plot so that it is gorgeous and well-labelled.
23. Screenshot it and email it to your family.
Instructor tip:
After completing the exercises here, it is worthwhile devoting time to the Review modules entitled, “A dplyr
mystery”, “A dplyr
survey”, and “Global health and ggplot
”. Once students become comfortable with working with dplyr
, they will be ready to work independently on projects, using the modules in the Deep R
section for references.