Predicting ranges of species from latitude and longitude coordinates has become increasingly easier with a suite of R packages. This introductory tutorial will show you how to turn your coordinate data into a range map.
Species distribution modeling in becoming an increasingly important tool to understand how organisms might respond to current and future environmental changes. There is an ever-growing number of approaches and literature for species distribution models (SDMs), and you are encouraged to check out the Additional Resources section for a few of these resources. The vignette for the
dismo package is especially useful, and Jeremy Yoder’s introduction is another great place to start. In this tutorial, we’ll use publicly available data to build, evaluate, and visualize a distribution model for the saguaro cactus.
Before we do anything, we will need to make sure we have necessary software, set up our workspace, download example data, and install additional packages that are necessary to run the models and visualize their output.
The packages necessary for species distribution modeling will likely require additional, non-R software to work properly. Which software will depend on the operating system of your computer.
On Debian Linux systems, you will likely need to install the libgdal-dev package. You can do this through the terminal via
sudo apt-get install libgdal-dev.
On Windows machines, you should probably install Rtools. You can find downloads and instructions at https://cran.r-project.org/bin/windows/Rtools/.
To use the raster package on Mac OS, you’ll need to install Xcode Command Line Tools package. You can do this through a terminal via
So, to start, create a pair of folders in your workspace:
dir.create(path = "data") dir.create(path = "output")
It is good practice to keep input (i.e. the data) and output separate. Furthermore, any work that ends up in the
output folder should be completely disposable. That is, the combination of data and the code we write should allow us (or anyone else, for that matter) to reproduce any output.
The data we are working with are observations of the saguaro, Carnegiea gigantea. We are using a subset of records available from GBIF, the Global Biodiversity Information Facility. You can download the data from https://tinyurl.com/saguaro-obs; save it in the
data folder that you created in the step above.
Next, there are five additional R packages that will need to be installed:
To install these, run:
install.packages("dismo") install.packages("maptools") install.packages("rgdal") install.packages("raster") install.packages("sp")
The basic idea behind species distribution models is to take two sources of information to model the conditions in which a species is expected to occur. The two sources of information are:
dismopackage to download these data (see below).
We’ll start our script by loading those five libraries we need. And of course adding a little bit of information at the very top of our script that says what the script does and who is responsible!
# Species distribution modeling for saguaro # Jeff Oliver # email@example.com # 2018-02-27 library("sp") library("raster") library("maptools") library("rgdal") library("dismo")
There is a good chance you might have seen some red messages print out to the screen, especially when loading the
rgdal libraries. This is normal, and as long as none of the messages include “ERROR”, you can just hum right through those messages. If loading the libraries does result in an ERROR message, check to see that the libraries were installed properly.
Now that we have those packages loaded, we can download the bioclimatic variable data with the
bioclim.data <- getData(name = "worldclim", var = "bio", res = 2.5, path = "data/")
getData four critical pieces of information:
name = "worldclim": This indicates the name of the data set we would like to download
var = "bio": This tells
getDatathat we want to download all 19 of the bioclimatic variables, rather than individual temperature or precipitation measurements
res = 2.5: This is the resolution of the data we want to download; in this case, it is 2.5 minutes of a degree. For other resolutions, you can check the documentation by typing
?getDatainto the console.
path = "data/": Finally, this sets the location to which the files are downloaded. In our case, it is the
datafolder we created at the beginning.
Note also that after the files are downloaded to the
data folder, the are read into memory and stored in the variable called
# Read in saguaro observations obs.data <- read.csv(file = "data/Carnegiea-gigantea-GBIF.csv") # Check the data to make sure it loaded correctly summary(obs.data)
## gbifid latitude longitude ## Min. :2.021e+08 Min. :26.78 Min. :-114.0 ## 1st Qu.:1.453e+09 1st Qu.:32.17 1st Qu.:-111.4 ## Median :1.571e+09 Median :32.28 Median :-111.1 ## Mean :1.567e+09 Mean :32.16 Mean :-111.3 ## 3rd Qu.:1.677e+09 3rd Qu.:32.38 3rd Qu.:-111.0 ## Max. :1.806e+09 Max. :34.80 Max. :-109.3 ## NA's :3 NA's :3
Notice that there are three
NA values in the
longitude columns. Those records will not be of any use to us, so we can remove them from our data frame:
# Notice NAs - drop them before proceeding obs.data <- obs.data[!is.na(obs.data$latitude), ] # Make sure those NA's went away summary(obs.data)
## gbifid latitude longitude ## Min. :8.910e+08 Min. :26.78 Min. :-114.0 ## 1st Qu.:1.453e+09 1st Qu.:32.17 1st Qu.:-111.4 ## Median :1.571e+09 Median :32.28 Median :-111.1 ## Mean :1.575e+09 Mean :32.16 Mean :-111.3 ## 3rd Qu.:1.677e+09 3rd Qu.:32.38 3rd Qu.:-111.0 ## Max. :1.806e+09 Max. :34.80 Max. :-109.3
When we look at the
obs.data data frame now there are no
NA values, so we are ready to proceed.
To make species distribution modeling more streamlined, it is useful to have an idea of how widely our species is geographically distributed. We are going to find general latitudinal and longitudinal boundaries and store this information for later use:
# Determine geographic extent of our data max.lat <- ceiling(max(obs.data$latitude)) min.lat <- floor(min(obs.data$latitude)) max.lon <- ceiling(max(obs.data$longitude)) min.lon <- floor(min(obs.data$longitude)) geographic.extent <- extent(x = c(min.lon, max.lon, min.lat, max.lat))
Before we do any modeling, it is also a good idea to run a reality check on your occurrence data by plotting the points on a map.
# Load the data to use for our base map data(wrld_simpl) # Plot the base map plot(wrld_simpl, xlim = c(min.lon, max.lon), ylim = c(min.lat, max.lat), axes = TRUE, col = "grey95") # Add the points for individual observation points(x = obs.data$longitude, y = obs.data$latitude, col = "olivedrab", pch = 20, cex = 0.75) # And draw a little box around the graph box()
Now that our occurrence data look OK, we can use the bioclimatic variables to create a model. The first thing we want to do though is limit our consideration to a reasonable geographic area. That is, for our purposes we are not looking to model saguaro habitat suitability globally, but rather to the general southwest region. So we can restrict the biolimatic variable data to the geographic extent of our occurrence data:
# Crop bioclim data to geographic extent of saguaro bioclim.data <- crop(x = bioclim.data, y = geographic.extent) # Build species distribution model bc.model <- bioclim(x = bioclim.data, p = obs.data)
## Error in .xyValues(x, as.matrix(y), ...): xy should have 2 columns only. ## Found these dimensions: 400, 3
Uh oh. That’s not good. It looks like the data we passed to
bioclim is not in the right format. The clue comes in the second line of the error message:
## Found these dimensions: 400, 3. This is referring to the
obs.data data frame, which does indeed have 400 rows and three columns. From the documentation from
bioclim (see for yourself via
?bioclim in the console):
bioclim(x, p, …)
x Raster* object or matrix
p two column matrix or SpatialPoints* object
… Additional arguments
So whatever we pass to
p should only have two columns. Let’s modify the
obs.data so it only has two columns. The first column is the GBIF identifier, which we will not need, so we drop it using the negation operator (i.e. the minus sign). Then we can run the species distribution model.
# Drop unused column obs.data <- obs.data[, c("latitude", "longitude")] # Build species distribution model bc.model <- bioclim(x = bioclim.data, p = obs.data)
## Error in bioclim(data.frame(m), ...): insufficient records
What the…? OK, this error message is tougher to figure out. But let’s consider what our
obs.data data frame looks like now:
## latitude longitude ## 1 32.33556 -110.8980 ## 2 32.28267 -110.9028 ## 3 30.29105 -110.7213 ## 4 32.05413 -110.6837 ## 5 32.25111 -110.7169 ## 6 32.19404 -111.0198
The first column is latitude and the second column is longitude, which seems fine. That is, until we think about how R generally deals with coordinates. When we plot something, we generally use syntax like this:
The thing to note is that the first argument we pass is data for the x-axis and the second argument is for the y-axis. The
bioclim function is looking for data in the same order. That is, it looks at whatever we passed to
p and assumes the first column is for the x-axis and the second column is for the y-axis. But our data is in the opposite order: the first column is latitude, essentially the y-axis data, and our second column is longitude, corresponding to x-axis data. So we need to reverse the column order before we pass
# Reverse order of columns obs.data <- obs.data[, c("longitude", "latitude")] # Build species distribution model bc.model <- bioclim(x = bioclim.data, p = obs.data)
Woo-hoo! No errors here (hopefully).
There’s one more step we need to take before we plot the model on a map. We need to generate an object that has the model’s probability of occurrence for saguaros. We use the
predict model from the
# Predict presence from model predict.presence <- dismo::predict(object = bc.model, x = bioclim.data, ext = geographic.extent)
You might be wondering about why we use
dismo::predict rather than just
predict. Not surprisingly, different packages sometimes use the same function name to perform very different operations. In the case of
predict, there are at least three packages loaded into memory that have a
stats. Although we probably would have been fine just using
predict (R would have use the
dismo version), specifying the
dismo version explicitly communicates this fact to anyone else reading the code. So, rather than leaving others (or your future self!) guessing, we can use the
Enough! It’s time to plot. We start as we did before, with a blank gray map, add the model, and if we feel like it, add the original observations as points.
# Plot base map plot(wrld_simpl, xlim = c(min.lon, max.lon), ylim = c(min.lat, max.lat), axes = TRUE, col = "grey95") # Add model probabilities plot(predict.presence, add = TRUE) # Redraw those country borders plot(wrld_simpl, add = TRUE, border = "grey5") # Add original observations points(obs.data$longitude, obs.data$latitude, col = "olivedrab", pch = 20, cex = 0.75) box()
This plot shows the probability of occurrence of saguaros across the map. Note the values are all quite below 1.0; in fact, the maximum probability anywhere on the map is only 0.78, according to the model. However, we are pretty sure that saguaros are found across a pretty broad area of the Sonoran Desert - after all, we have the observations to prove that! If we want our map to better reflect this, we will need to re-run our analyses, but this time include some absence points, where saguaros are known to not occur. The problem is, we only have presence data for saguaros.