Module 10 Subsetting & filtering
Learning goals
- Understand how to subset / filter data
Subsetting with indices
You have already learned that certain elements of a vector can be called by specifying an index:
You can also subset an object by calling multiple indices:
Subsetting with booleans
You can also subset objects with ‘booleans’. Recall that boolean / logical data have two possible values: TRUE or FALSE. For example:
Recall also that you can calculate whether a condition is TRUE or FALSE on multiple elements of a vector. For example:
ages <- c(10, 20, 30, 40, 50, 60)
old_age <- 36
people_are_old <- ages >= old_age
people_are_old
## [1] FALSE FALSE FALSE TRUE TRUE TRUEBoolean vectors are useful for subsetting. Think of ‘subsetting’ as keeping only those elements of a vector for which a condition is TRUE.
# Now subset to the second, third, and fourth element
x[c(FALSE, TRUE, TRUE, TRUE, FALSE)]
## [1] 56 57 58That command returned elements for which the subetting vector was TRUE.
This is equivalent to…
You can also get the same result using a logical test, since logical tests return boolean values:
# Develop your logical test
x %in% c(56,57,58)
## [1] FALSE TRUE TRUE TRUE FALSE
# Plug it into the subsetting brackets
x[ x %in% c(56,57,58) ]
## [1] 56 57 58This methods gets really useful when you are working with bigger datasets, such as this one:
With a dataset like this, you can use a boolean filter to figure out how many values are greater than, say, 9.
First, develop your logical test, which will tell you whether each value in the vector is greater than 9:
# Develop your logical test,
y > 9
## [1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [73] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSENow, to get the values corresponding to each TRUE in this list, plug your logical test into your subsetting brackets.
y[y > 9]
## [1] 9.521493 9.609064 9.190241 9.923868 9.232551 9.253394 9.330497 9.396324
## [9] 9.881984Here’s another way you can do the same thing:
verdicts <- y > 9
y[verdicts]
## [1] 9.521493 9.609064 9.190241 9.923868 9.232551 9.253394 9.330497 9.396324
## [9] 9.881984You can use double logical tests too. For example, what if you want all elements between the values 7.0 and 9.0?
verdicts <- y > 7 & y < 9
y[verdicts]
## [1] 7.766709 8.147243 8.831927 8.954981 7.544503 7.149351 8.688917 7.147429
## [9] 8.906385 8.843096 7.805055 8.369364 8.629683 7.914165 7.610241 7.731196
## [17] 8.789806Review assignment
A. Create a vector named nummies of all numbers from 1 to 100
B. Create another vector named little_nummies which consists of all those numbers which are less than or equal to 30
C. Create a boolean vector named these_are_big which indicates whether each element of nummies is greater than or equal to 70
D. Use these_are_big to subset nummies into a vector named big_nummies
E. Create a new vector named these_are_not_that_big which indicates whether each element of nummies is greater than 30 and less than 70. You’ll need to use the & symbol.
F. Create a new vector named meh_nummies which consists of all nummies which are greater than 30 and less than 70.
G. How many numbers are greater than 30 and less than 70?
H. What is the sum of all those numbers in meh_nummies