Module 30 Cleaning messy data

Learning goals

  • This is a review exercise: learn the skills introduced in the previous modules by applying them to a universal data science scenario: cleaning up messy data.

Your mission

In the module on joining datasets, we introduced a dataset of whale diving behaviour:

whales-dives.csv

This dataframe is nice and tidy. Here are a few of its many tidy features:

  • Each row is a single observation.

  • There is not a single missing value anywhere in the dataset.

  • The rows are organized from earliest to most recent, based on the data embedded in the sit column.

  • Categorical columns have standardized formatting. In the species column, there are two levels: HW (humpback whale) and FW (fin whale). In the behavior column, there are also two levels: FEED and OTHER.

But this dataset was not always so pretty. Here is the link to the original data file:

whales-dives-messy.csv

Your task in this review exercise is to write a script that carries out the necessary data cleaning steps to get this dataset from its original form to its tidy form.

Test your work along the way, then demonstrate its completion, using the identical() function. If your my_dives version of the dataset is identical to the dives data above, the following logical test will be TRUE:

Enjoy!