Module 12 Subsetting & filtering
Learning goals
- Understand how to subset / filter data
You have been introduce to subsetting and filtering briefly in previous modules, but it is such an important concept that we want to devote an entire module to practicing it.
Subsetting with indices
You have already learned that certain elements of a vector can be called by specifying an index:
Remember: brackets indicate that you don’t want everything from a vector; you just want certain elements. ‘I want x
, but not all of it.’
You can also subset an object by calling multiple indices:
Subsetting with booleans
You can also subset objects with ‘booleans’. This will eventually be your most common way of filtering data, by far.
Recall that boolean / logical data have two possible values: TRUE
or FALSE
. For example:
# Store Joe's age
joes_age <- 35
# Set the cutoff for old age
old_age <- 36
# Ask whether Joe is old
joes_age >= old_age
[1] FALSE
Recall also that you can calculate whether a condition is TRUE
or FALSE
on multiple elements of a vector. For example:
# Build a vector of multiple ages
ages <- c(10, 20, 30, 40, 50, 60)
# Set the cutoff for old age
old_age <- 36
# Ask which ages are considered old
ages >= old_age
[1] FALSE FALSE FALSE TRUE TRUE TRUE
Boolean vectors are super useful for subsetting. Think of ‘subsetting’ as keeping only those elements of a vector for which a condition is TRUE
.
# Now subset to the second, third, and fourth element
x[c(FALSE, TRUE, TRUE, TRUE, FALSE)]
[1] 56 57 58
That command returned elements for which the subetting vector was TRUE
.
This is equivalent to…
You can also get the same result using a logical test, since logical tests return boolean values:
# Develop your logical test: ask which values of x are in the vector 56:58
x %in% c(56,57,58)
[1] FALSE TRUE TRUE TRUE FALSE
# Now plug taht test it into the subsetting brackets
x[ x %in% c(56,57,58) ]
[1] 56 57 58
This methods gets really useful when you are working with bigger datasets, such as this one:
With a dataset like this, you can use a boolean filter to figure out how many values are greater than, say, 90.
First, develop your logical test, which will tell you whether each value in the vector is greater than 90:
# Develop your logical test,
y > 90
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[85] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[97] TRUE TRUE TRUE TRUE
Now, to get the values corresponding to each TRUE
in this list, plug your logical test into your subsetting brackets.
y[y > 90]
[1] 530 879 606 460 537 417 792 989 121 478 159 897 890 516 104 526 372 712 370
[20] 909 310 403 967 276 735 975 326 739 610 607 809 431 657 361 728 143 805 261
[39] 912 816 870 497 235 971 854 774 376 936 608 804 216 95 453 347 979 156 826
[58] 286 678 933 223 644 800 385 323 557 838 840 405 996 470 563 371 544 437 113
[77] 192 753 931 180 738 873 451 336 182 585 523 727 137 204 749 843
Here’s another way you can do the same thing:
# Save the result of your logical test in a new vector
verdicts <- y > 90
# Use that vector to subset y
y[verdicts]
[1] 530 879 606 460 537 417 792 989 121 478 159 897 890 516 104 526 372 712 370
[20] 909 310 403 967 276 735 975 326 739 610 607 809 431 657 361 728 143 805 261
[39] 912 816 870 497 235 971 854 774 376 936 608 804 216 95 453 347 979 156 826
[58] 286 678 933 223 644 800 385 323 557 838 840 405 996 470 563 371 544 437 113
[77] 192 753 931 180 738 873 451 336 182 585 523 727 137 204 749 843
You can use double logical tests too. For example, what if you want all elements between the values 70 and 90?
Review assignment
1. Create a vector named nummies
of all numbers from 1 to 100
2. Create another vector named little_nummies
which consists of all those numbers which are less than or equal to 30
3. Create a boolean vector named these_are_big
which indicates whether each element of nummies
is greater than or equal to 70
4. Use these_are_big
to subset nummies
into a vector named big_nummies
5. Create a new vector named these_are_not_that_big
which indicates whether each element of nummies
is greater than 30 and less than 70. You’ll need to use the &
symbol.
6. Create a new vector named meh_nummies
which consists of all nummies
which are greater than 30 and less than 70.
7. How many numbers are greater than 30 and less than 70?
8. What is the sum of all those numbers in meh_nummies