Module 24 Review exercise: Cleaning messy data

Learning goals

  • This is a review exercise: learn the skills introduced in the previous modules by applying them to a universal data science scenario: cleaning up messy data.

Your mission

In the module on joining datasets, we introduced a dataset of whale diving behaviour:

whales-dives.csv

dives <- read.csv("./data/whales-dives.csv")
head(dives)
           id species behavior prey.volume prey.depth dive.time surface.time
1 20140811106      HW     FEED    6.914610     120.76    351.00          237
2 20140812104      HW     FEED    7.854762      79.02    281.00           87
3 20140812107      HW     FEED    7.385667      96.92    300.25           80
4 20140812109      FW     FEED    6.626298     105.87    366.00          189
5 20140812131      HW    OTHER    6.356474     123.95    357.00          112
6 20140812140      FW     FEED    3.820782     125.51    408.00          182
  blow.interval blow.number
1        26.833      10.000
2        14.412       6.667
3        16.000       6.000
4        16.273      12.000
5        25.250       6.000
6        18.789      11.000

This dataframe is nice and tidy. Here are a few of its many tidy features:

  • Each row is a single observation.

  • There is not a single missing value anywhere in the dataset.

  • The rows are organized from earliest to most recent, based on the data embedded in the sit column.

  • Categorical columns have standardized formatting. In the species column, there are two levels: HW (humpback whale) and FW (fin whale). In the behavior column, there are also two levels: FEED and OTHER.

But this dataset was not always so pretty. Here is the link to the original data file:

whales-dives-messy.csv

Your task in this review exercise is to write a script that carries out the necessary data cleaning steps to get this dataset from its original form to its tidy form.

Test your work along the way, then demonstrate its completion, using the identical() function. If your my_dives version of the dataset is identical to the dives data above, the following logical test will be TRUE:

identical(dives,my_dives)

Enjoy!