Module 44 Joining datasets

Learning goals

  • How to join together two related datasets efficiently in R

 

It’s not unusual to have data for a project spread out across multiple datasets. The data are all related, but the fact that they are ‘packaged’ separately into different files can things difficult. To work with such relational data in R, you eventually need to merge – or join – them into a single dataframe.

For example, say you are studying the economics of professional basketball. In one dataframe, df1, you have the net worth of famous players, in millions USD…

…and in a second dataframe, df2, you have their height.

Is there a correlation between these players’ net worth and their height? To answer this question you need to join these two datasets together. But the two datasets don’t contain the exact same roster of players, and the players are not in the same order. Hmm. Is there an efficient way to join these two datasets?

Yes! Thanks to the family of join() functions from the package dplyr.

Joining with dplyr

These join() functions relate two dataframes together according to a common column name. In order for these functions to work, each dataframe has to have a column of the same name. In the case of df1 and df2, that shared column is player.

There are four join() functions you should know how to use:

left_join() joins the two datasets together, keeping all rows in the first dataframe you feed it.

Notice that none of the players unique to df2 made it into this output; only players in df1 (i.e., the left of the two dataframes listed) remain. Also notice that any player from df1 who was not in df2 has an NA under the column worth. That is how these join() functions work: it fills in the data where it can, and leaves NAs where it can’t.

right_join() keeps all rows in the second dataframe you feed it:

full_join() keeping all rows in both dataframes:

Finally, inner_join() keeps only the rows that are common to both dataframes:

So, to answer our research question about the relationship between player height and net worth, we can now join these data sets and make a nice plot:

Hmm. Looks like we have an outlier!

Review exercise

This review exercise will involve both the joining skills you learned above as well as many of the skills learned in prior modules. But don’t worry: as with every exercise in this book, every puzzle here can be solved using the skills you have learned in prior modules.

Scenario: A sailing expedition conducted a survey of the whales in the fjords of British Columbia, Canada. That research produced two datasets:

whales-environment.csv

whales-dives.csv

Each row in these dataframes represent a close encounter with a whale (either a humpback whale, HW, or a fin whale, FW).

The env dataset provides details about the habitat in which the whale was found, such as seafloor depth and the amount of chlorophyll in the water (a proxy for productivity).

The dive dataset provides measurements of whale foraging behaviors, such as dive.time and the number of breaths at the surface (blow.number), as well as the quality of prey in the area (prey.volume and prey.depth).

Note that these two dataframes are linked by the id column, which is a unique code for each whale encounter.

Also note that some ids can be found in both dataframes. These are the encounters for which we have both foraging behavior data as well as environmental data.

Task 1. Summarize your dataset.

Write the necessary code and fill in the spaces in the data summary statement below:

A total of BLANK whale encounters were recorded on BLANK separate days in the years BLANK - BLANK. BLANK encounters were with humpback whales (BLANK % of all encounters), and BLANK were with fin whales (BLANK %). Foraging behavior was recorded in BLANK % of encounters (BLANK % of humpback whale encounters, and BLANK % of fin whale encounters).

Task 2. Research question: Do these two species prefer different seafloor depths?

First, create a nicely formatted histogram that portrays the data relevant to this question.

Provide the code to test this research question statistically, then write a results statement below.

Task 3. Another research question: Is prey volume correlated to chlorophyll concentration?

Whales eat tiny shrimp-like critters named krill, and krill eat tiny organisms called phytoplankton. Chlorophyll is a proxy measurement for the amount of phytoplankton in the area. Based on these datasets, how good are krill at congregating in areas of high chlorophyll concentration?

First, prepare a plot to visualize what you will be comparing:

hwe <- env[env$species=="HW",] ; head(hwe)
           id species year distance seafloor temperature salinity thermocline
1 20140811103      HW 2014  44.1021  352.881    13.49104 25.75687      9.6325
2 20140811104      HW 2014  44.2729  363.563    13.49104 25.75687      9.6325
3 20140811106      HW 2014  46.7883  361.660    13.97642 25.14305      9.7859
4 20140812102      HW 2014  44.9049  317.864    14.11392 22.26935      9.7689
5 20140812104      HW 2014  44.8370  327.717    14.12843 22.21240      9.7689
6 20140812103      HW 2014  44.6589  353.749    14.17754 22.08326      9.7426
  stratification euphotic.depth chlorophyll
1         5.0603        18.2951     81.8442
2         5.0603        18.2951     81.8442
3         5.5067        17.2179     66.1236
4         5.2032        18.0307     76.0000
5         5.2032        18.0307     76.0000
6         5.2039        18.0093     75.9543
hwd <- dive[dive$species=="HW",] ; head(hwd)
           id species behavior prey.volume prey.depth dive.time surface.time
1 20140811106      HW     FEED    6.914610     120.76    351.00          237
2 20140812104      HW     FEED    7.854762      79.02    281.00           87
3 20140812107      HW     FEED    7.385667      96.92    300.25           80
5 20140812131      HW    OTHER    6.356474     123.95    357.00          112
7 20140813105      HW     FEED    7.472343     104.13    365.00           52
8 20140813107      HW     FEED    3.261088     100.19    338.00           20
  blow.interval blow.number
1        26.833      10.000
2        14.412       6.667
3        16.000       6.000
5        25.250       6.000
7        18.610       4.500
8        33.322       1.000

nrow(hwe)
[1] 446
hwe <- hwe[hwe$id %in% hwd$id,]
nrow(hwe) ; nrow(hwd)
[1] 89
[1] 89

hwe <- hwe[order(hwe$id),]
hwd <- hwd[order(hwd$id),]
hwd$chl <- hwe$chlorophyll

par(mar=c(4.5,4.5,3,1))
plot(hwd$prey.volume ~ hwd$chl,
     xlab="Chlorophyll concentration",
     ylab="Prey volume",
     pch=16,
     cex=.8)
hwlm <- lm(hwd$prey.volume ~ hwd$chl)
abline(hwlm,col="firebrick",lwd=2)

Now carry out your statistical test.