This brief tutorial will demonstrate how to create a basic plot in R from a text file of data. This introduction provides an entry point for those unfamiliar with R (or a refresher for those who are rusty). We will start with a very minimal piece of code and work our way up to code that automates the creation of 12 different PDF files, each with a different X-Y scatterplot.

Learning objectives

  1. Gain familiarity with R
  2. Read data from a file
  3. Visualize data in a graph
  4. Understand the principle of control flow

In this tutorial, we will be using the ‘gapminder’ dataset, available here: http://tinyurl.com/gapminder-five-year-csv (right-click or Ctrl-click on link and Save As…). Make sure to save it somewhere you can remember (like your Desktop), as we will need to move it later.

Setup

First we need to setup our development environment. Open RStudio and create a new project via:

We need to create two folders: ‘data’ will store the data we will be analyzing, and ‘output’ will store the results of our analyses. In the RStudio console:

dir.create(path = "data")
dir.create(path = "output")

Finally, move the file you downloaded (gapminder-FiveYearData.csv) into the data folder you just created. This is best accomplished outside of RStudio, using your computer’s file management system.

Your first plot

Plot the points!

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

When reading data into R, it is always a good idea to make sure you read in the data correctly. There are a number of ways to investigate your data, but three common methods are:

  • head shows us the first 6 rows in a data frame.
  • str (structure) provides some information about the data stored in the data frame.
  • summary provides even more information about the data, including some summary statistics for numerical data.
# Investigate data
head(all_gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
str(all_gapminder)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
summary(all_gapminder)
##         country          year           pop               continent      lifeExp        gdpPercap       
##  Afghanistan:  12   Min.   :1952   Min.   :6.001e+04   Africa  :624   Min.   :23.60   Min.   :   241.2  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.794e+06   Americas:300   1st Qu.:48.20   1st Qu.:  1202.1  
##  Algeria    :  12   Median :1980   Median :7.024e+06   Asia    :396   Median :60.71   Median :  3531.8  
##  Angola     :  12   Mean   :1980   Mean   :2.960e+07   Europe  :360   Mean   :59.47   Mean   :  7215.3  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.959e+07   Oceania : 24   3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Australia  :  12   Max.   :2007   Max.   :1.319e+09                  Max.   :82.60   Max.   :113523.1  
##  (Other)    :1632

Because we are only interested in graphing data from 2002, we just pull out those data.

# Subset data, retaining only those data from 2002
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

Last but not least, we use the plot function to draw the plot.

# Plot points
plot(x = gapminder$gdpPercap, 
     y = gapminder$lifeExp)


Make it pretty

Before we get to the next step we need to start using scripts, rather than typing directly into the console. This is the ideal way to save your work. You can create a new script in RStudio via File > New File > R Script (or the shortcut Ctrl+Shift+N / Cmd+Shift+N). Name the script something like “r-graphing.R” (without the quotes) - you will have to add the .R extension when you name the file. At the start of every script you write, you should provide at least this basic information:

Fix those axes titles

Our original graph was a good start, but those axis labels are pretty unseemly. We can fix those up by setting values in the plot function call. Namely, we will be passing values to the main, xlab (x-axis label), and ylab (y-axis label) parameters.

# Graphing Life Expectancy vs. GDP
# Jeffrey C. Oliver
# jcoliver@email.arizona.edu
# 2016-11-15

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

# Subset data
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

plot(x = gapminder$gdpPercap, 
     y = gapminder$lifeExp, 
     main = "Life expectancy v. GDP", 
     xlab = "GDP Per capita", 
     ylab = "Life expectancy (years)")


Changing scales

Log-transform data

This does not look like a linear relationship, but we can try a simple log-transformation on the GDP data to see how that looks.

# Graphing Life Expectancy vs. GDP
# Jeffrey C. Oliver
# jcoliver@email.arizona.edu
# 2016-11-15

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

# Subset data
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

# Make new vector of Log GDP
gapminder$Log10GDP <- log10(gapminder$gdpPercap)

plot(x = gapminder$Log10GDP, 
     y = gapminder$lifeExp, 
     main = "Life expectancy v. GDP", 
     xlab = "Log(GDP Per capita)", 
     ylab = "Life expectancy (years)")


Make it prettier

Color the points

We are not restricted to black and white colors. Here we will color points by the continent each country is located on.

# Graphing Life Expectancy vs. GDP
# Jeffrey C. Oliver
# jcoliver@email.arizona.edu
# 2016-11-15

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

# Subset data
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

# Make new vector of Log GDP
gapminder$Log10GDP <- log10(gapminder$gdpPercap)

Start by looking at the different values in the gapminder$continent vector.

# What are the possible values for continent?
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

There are five possible values, so we will need five different colors. The first step is to create a new vector in the gapminder data frame; call it colors and fill it with NA values. Then assign colors based on the value in the gapminder$continent vector.

# Create new vector for colors
gapminder$colors <- NA

# Assign colors based on gapminder$continent
gapminder$colors[gapminder$continent == "Africa"] <- "red"
gapminder$colors[gapminder$continent == "Americas"] <- "orange"
gapminder$colors[gapminder$continent == "Asia"] <- "forestgreen"
gapminder$colors[gapminder$continent == "Europe"] <- "darkblue"
gapminder$colors[gapminder$continent == "Oceania"] <- "violet"

# Create main plot
plot(x = gapminder$Log10GDP, 
     y = gapminder$lifeExp, 
     main = "Life expectancy v. GDP", 
     xlab = "Log(GDP Per capita)", 
     ylab = "Life expectancy (years)",
     col = gapminder$colors,
     pch = 18) # A diamond symbol

# We will also need to add a legend, so we know what the colors mean.
# Here we have to be sure the order of the colors matches the order 
# of the different levels of gapminder$continents.
legend("topleft", 
       legend = levels(gapminder$continent), 
       col = c("red", "orange", "forestgreen", "darkblue", "violet"),
       pch = 18)


Prevent mistakes

Abstract the code

When writing code, abstraction can be quite useful, to ensure consistency throughout your code.

# Graphing Life Expectancy vs. GDP
# Jeffrey C. Oliver
# jcoliver@email.arizona.edu
# 2016-11-15

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

# Subset data
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

# Make new vector of Log GDP
gapminder$Log10GDP <- log10(gapminder$gdpPercap)

# Store values to use for some plotting parameters
symbol <- 18
sym_size <- 1.2
continents <- levels(gapminder$continent)
continent_colors <- c("red", "orange", "forestgreen", "darkblue", "violet")

# Assign colors for each continent
gapminder$colors <- NA
gapminder$colors[gapminder$continent == continents[1]] <- continent_colors[1]
gapminder$colors[gapminder$continent == continents[2]] <- continent_colors[2]
gapminder$colors[gapminder$continent == continents[3]] <- continent_colors[3]
gapminder$colors[gapminder$continent == continents[4]] <- continent_colors[4]
gapminder$colors[gapminder$continent == continents[5]] <- continent_colors[5]

# Create main plot
plot(x = gapminder$Log10GDP, 
     y = gapminder$lifeExp, 
     main = "Life expectancy v. GDP", 
     xlab = "Log(GDP Per capita)", 
     ylab = "Life expectancy (years)",
     col = gapminder$colors,
     pch = symbol,
     cex = sym_size,
     lwd = 1.5)

# Add a legend - we don't have to worry about getting the colors 
# in the right order because we are using the same vector that we used 
# when assigning colors in the first place.
legend("topleft", 
       legend = continents, 
       col = continent_colors,
       pch = symbol)

# Add a regression line
lifeExp_lm <- lm(gapminder$lifeExp ~ gapminder$Log10GDP)
abline(reg = lifeExp_lm, lty = 2, lwd = 2)


Even fewer mistakes via less code o.O

More code abstraction & loops

Continuing with abstraction, we can simplify our code with control flow - here we use a for loop to accomplish the color assignment task.

# Graphing Life Expectancy vs. GDP
# Jeffrey C. Oliver
# jcoliver@email.arizona.edu
# 2016-11-15

# Read in data
all_gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv",
                          stringsAsFactors = TRUE)

# Subset data
gapminder <- all_gapminder[all_gapminder$year == 2002, ]

# Make new vector of Log GDP
gapminder$Log10GDP <- log10(gapminder$gdpPercap)

# Store values to use for some plotting parameters
symbol <- 18
sym_size <- 1.2
continents <- levels(gapminder$continent)
continent_colors <- c("red", "orange", "forestgreen", "darkblue", "violet")

# Establish empty column to store colors
gapminder$colors <- NA
# Loop over continents and assign colors
for (i in 1:length(continents)) {
  gapminder$colors[gapminder$continent == continents[i]] <- continent_colors[i]
}

# Create main plot
plot(x = gapminder$Log10GDP, 
     y = gapminder$lifeExp, 
     main = "Life expectancy v. GDP", 
     xlab = "Log(GDP Per capita)", 
     ylab = "Life expectancy (years)",
     col = gapminder$colors,
     pch = symbol,
     cex = sym_size,
     lwd = 1.5)

# Add a legend
legend("topleft", 
       legend = continents, 
       col = continent_colors,
       pch = symbol)

# Add a regression line
lifeExp_lm <- lm(gapminder$lifeExp ~ gapminder$Log10GDP)
abline(reg = lifeExp_lm, lty = 2, lwd = 2)