- Understand the difference between vector and raster data, and when to use which.
- Learn how to work with vector and raster data in base maps in
- Learn how to work with vector and raster data in
For some good examples of bad maps – the kind of thing we are trying to help you avoid – check out this Twitter feed.
We don’t often read scatter plots in life, but we read maps all the time. Learning how to work with spatial data and create beautiful maps are essential skills as a data scientist. To get there, you need to become familiar with Geographical Information Systems, or GIS.
In a GIS, spatial data are tied together in a database that makes it (1) easy to map the data and (2) easy to access all sorts of information underlying each spatial feature.
GIS matters because most data, it turns out, have an important geographic component. There are nearly always important geographic questions you can ask about your data.
In your work with spatial data, you are going to encounter two main types of data:
Vectors are discrete shapes built up by individual points.
There are three common types of vectors:
A point is a single location (zero-dimensional).
A line is a set of points connected by straight lines (1-dimensional).
A polygon is a shape formed by a line whose beginning and end are the same location: it is a line that begins and ends at the same spot and thus encloses an area. Polygons have an inside and an outside.
Vectors are best used to represent discrete objects or areas: the location of an observation, a river, a road, a building, a country’s boundary, etc.
Most vectors have two types of data: geographic information, such as coordinates, and tabular data with attributes for the vector. For this reason, vectors are usually saved in shapefiles, which typically require a folder of several different files to use. That’s why most shapefiles are downloaded as zipped folders.
Rasters are grids. They have rows and columns, just like a dataframe, and each grid cell represents a single value. Rasters are best used for continuous phenomena – in other words, things that occur everywhere: temperature, elevation, humidity, etc.
We interact with rasters all the time: think of smartphone screens and digital photos. Such images are simply rasters of pixels. Each pixel has a value representing color and brightness.
R as a GIS, first download a bunch of libraries that are likely to be useful:
We’re going to do some exercises with Tennessee zip codes.
What is a zip code?
What kind of data format is zip code data likely to be in?
Load up some tabular data on zip codes by running the below:
Inspect the data. What is the unit of observation.
Make a histogram of population.
Which zip code is the most populous?
Which city has the most zip codes?
What is the total population of Tennessee?
Make a chart of the 5 most populous cities in Tennessee, showing their population.
Load up some geographic data on Tennessee zip codes by running the below:
Plot the map using the
Run the below in order to convert the format to a ggplot-friendly dataframe.
Make a ggplot of the tng object.
tnz, make a new variable named
percent_waterwhich shows the percentage of each zip code that is water (in area).
Bring data from
tngby running the below:
Modify the above ggplot so that the filled color of each polygon reflects the new variable,
Make a leaflet map with no polygons on it.
Add the Tennessee zip codes to the leaflet map.
Make the width of the lines thinner and make the background totally transparent.
We want to make a map showing population density. We’ll start by making a color palette:
- Make a choropleth of population in leaflet.
Make a choropleth of percentage water in leaflet. Make it blue.
Create a dataframe of the “centroid” of each zip code by running the below:
- Define the location, in longitude, latitude, of Sewanee
geospherepackage to get the distance from Sewanee to each zip code’s centroid.
- Deposit the
distancesobject as a column in the
- Make a leaflet choropleth of distance to Sewanee.
To download some raster data, user the function
getData() from the
This function is great for getting country-level data. Use
ccodes() to get the codes for each country:
NAME ISO3 ISO2 NAME_ISO NAME_FAO 1 Afghanistan AFG AF AFGHANISTAN Afghanistan 2 Akrotiri and Dhekelia XAD <NA> <NA> <NA> 3 Åland ALA AX ÅLAND ISLANDS <NA> 4 Albania ALB AL ALBANIA Albania 5 Algeria DZA DZ ALGERIA Algeria 6 American Samoa ASM AS AMERICAN SAMOA American Samoa NAME_LOCAL SOVEREIGN UNREGION1 UNREGION2 continent 1 Afghanestan Afghanistan Southern Asia Asia Asia 2 Akrotiri and Dhekelia United Kingdom Western Asia Asia Asia 3 Åland Finland Northern Europe Europe Europe 4 Shqiperia Albania Southern Europe Europe Europe 5 Al Jaza'ir Algeria Northern Africa Africa Africa 6 American Samoa United States Polynesia Oceania Oceania
Since the United States consists of several distant territories (Hawaii, Alaska, Samoa, etc.), the object
usa is actually a list containing an object for each territory:
Let’s plot the continental United States:
How do you interpret this map? What type of data does this raster object contain?
Now plot a separate territory:
You can also use the
getData() function to get vectors, such as the boundaries of U.S. states:
Check out this
states object. It’s a new type of data structure: a
head(states) GID_0 NAME_0 GID_1 NAME_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1 1 USA United States USA.1_1 Alabama AL|Ala. <NA> State State 12 USA United States USA.2_1 Alaska AK|Alaska <NA> State State 23 USA United States USA.3_1 Arizona AZ|Ariz. <NA> State State 34 USA United States USA.4_1 Arkansas AR|Ark. <NA> State State 45 USA United States USA.5_1 California CA|Calif. <NA> State State 48 USA United States USA.6_1 Colorado CO|Colo. <NA> State State CC_1 HASC_1 1 <NA> US.AL 12 <NA> US.AK 23 <NA> US.AZ 34 <NA> US.AR 45 <NA> US.CA 48 <NA> US.CO
Subset this object to only the data for Tennessee:
Now plot it:
Now let’s try to combine raster data, such as elevation (stored in the object
usa[]), with vector data, such as state boundaries (stored in the object
To do so, let’s
crop the elevation raster to only the data contained with the Tennessee boundary:
Let’s load in some data specifically for the area of Sewanee, TN:
# Setup a temporary directory for these data destination_directory <- '/tmp' # Download the zipped folder of data destination_file <- file.path(destination_directory, 'sewanee.zip') download.file('https://raw.githubusercontent.com/databrew/intro-to-data-science/main/data/sewanee.zip', destfile = destination_file) # Unzip folder unzip(destination_file, exdir = destination_directory)
This zipped folder has data files for both rasters and vectors.
Read in the raster data for elevation:
Now let’s try to map Sewanee’s raster data onto the
tn vector of boundaries:
Oh no! That didn’t work. Why not? Have a look at coordinates:
It seems like these two objects are using different coordinate systems. Let’s check out what projection these objects are using:
tn coordinates are in classic lat/long, but the
elevation coordinates appear to be in UTM (Easting/Northing coordinates).
elevation’s coordinate system to than of
Great – now let’s try out plot again:
Phew – worked this time (even though it still ain’t pretty).
Now let’s bring in some vectors pertaining to the Sewanee Domain: boundaries, roads, and structures:
boundary <- readOGR(destination_directory, 'Boundary2016') OGR data source with driver: ESRI Shapefile Source: "/tmp", layer: "Boundary2016" with 7 features It has 2 fields structures <- readOGR(destination_directory, 'Domain_Structures') OGR data source with driver: ESRI Shapefile Source: "/tmp", layer: "Domain_Structures" with 1055 features It has 113 fields Integer64 fields read as strings: Total_ft2 ft2_ea_flr roads <- readOGR(destination_directory, 'Roads') OGR data source with driver: ESRI Shapefile Source: "/tmp", layer: "Roads" with 404 features It has 18 fields Integer64 fields read as strings: ID ID2 Column
This time let’s make a zoomed-in plot of Sewanee’s land, with elevation and the Domain boundary:
Uh oh. Same problem as before. We need to “reproject” boundary to latitude longitude.
This time, since we are now transforming a vector, we need to use the function
spTransform() instead of
Okay, let’s retry our plot:
mask to get just the elevation for the domain.
If we wanted more information about this
domain_elev raster, we could use some special exploration functions for rasters:
Now let’s add roads to the plot. This time, we will preemptively re-project the
What is we only want to use the roads that cross the Domain boundary? To do so, we will user the
Now let’s re-do the plot:
Let’s place Sewanee’s elevation raster onto a
First, let’s make a color palette (i.e., a chloropleth) to represent elevation:
We add the raster to the
leaflet map using the