Module 30 Cleaning messy data
Learning goals
- This is a review exercise: learn the skills introduced in the previous modules by applying them to a universal data science scenario: cleaning up messy data.
Your mission
In the module on joining datasets, we introduced a dataset of whale diving behaviour:
head(dives)
id species behavior prey.volume prey.depth dive.time surface.time
1 20140811106 HW FEED 6.914610 120.76 351.00 237
2 20140812104 HW FEED 7.854762 79.02 281.00 87
3 20140812107 HW FEED 7.385667 96.92 300.25 80
4 20140812109 FW FEED 6.626298 105.87 366.00 189
5 20140812131 HW OTHER 6.356474 123.95 357.00 112
6 20140812140 FW FEED 3.820782 125.51 408.00 182
blow.interval blow.number
1 26.833 10.000
2 14.412 6.667
3 16.000 6.000
4 16.273 12.000
5 25.250 6.000
6 18.789 11.000
This dataframe is nice and tidy. Here are a few of its many tidy features:
Each row is a single observation.
There is not a single missing value anywhere in the dataset.
The rows are organized from earliest to most recent, based on the data embedded in the
sit
column.Categorical columns have standardized formatting. In the
species
column, there are two levels:HW
(humpback whale) andFW
(fin whale). In thebehavior
column, there are also two levels:FEED
andOTHER
.
But this dataset was not always so pretty. Here is the link to the original data file:
Your task in this review exercise is to write a script that carries out the necessary data cleaning steps to get this dataset from its original form to its tidy form.
Test your work along the way, then demonstrate its completion, using the identical()
function. If your my_dives
version of the dataset is identical to the dives
data above, the following logical test will be TRUE
:
Enjoy!