Visualizing Data

Tutorial Objectives

The objectives of this tutorial are to:

Learn how to make common data visualisations in R using ggplot2, including how to:
- Make a scatterplot
  - Add a line of best fit to a scatterplot
  - Add categorical information to a scatterplot
  - Create a set of related scatterplots using facet_wrap()
- Make a boxplot
  - add additional data to a boxplot
  - add color to a boxplot
- Make a bar chart
  - add color to a bar chart
  - add additional data to a bar chart

Getting started

One particularly powerful aspect of R is that it enables one visualize data in a variety of ways. Additionally, it gives users a wide variety of customization options. We will work below on a number of common data visualizations. But rememeber, once you get the hang of the basics, the sky is the limit! Note that this tutorial is a simplified version of this one. For more details on what is possible, check out the source webpages!

To get started, lets load ggplot and look at one of the datasets that comes with ggplot:

library(ggplot2)
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

As we can see from the summary, this dataset includes a number of characteristics of cars that might affect fuel efficiency (in miles per gallon for city or highway driving) from 1999 to 2008. Remember that we can get more detailed information about the dataset by using the “help” function in R:

help(mpg)

Making our first plot: Scatterplots

A scatterplot is a two-dimensional plot where each point represents two characteristics of an observation. The two characteristics (which we can call variables) are represented by two axes of the plot. The x axis runs horizontally, while the y axis runs vertically.

The scatterplot we will create below, each point will represent information about a particular vehicle. The x axis indicates the size of the vehicle’s engine (measured as engine displacement in liters), while the y axis indicates the vehicle’s fuel efficiency (measured as miles per gallon on the highway).

We will often have hypotheses about the nature of our data prior to analyzing it. My simple prediction is that vehicles with larger engines will tend to get lower fuel efficiency. Now, we will visualize our data and see if my prediction seems to be accurate.

ggplot(data = mpg) + #this tells ggplot which dataset to use
  geom_point(mapping = aes(x = displ, y = hwy)) #this sets the x and y axis and plots each data point

Now, let’s look at the plot and try to make sense of what it is telling us. First, lets look at the x axis, which represents engine displacement. If we look at engines that have around 2 liters of displacement or less, what kind of fuel efficiency are they getting (see the y axis, which represents highway fuel efficiency)? As engine displacement increases, what happens to fuel efficiency?

As we can see from the graph, as engine displacement goes up, highway fuel efficiency tends to go down.

Add a line of best fit

We can also add a line of best fit, which summarizes the data in the scatterplot using a single line. Note that adding layers to plots using ggplot is very simple. We only need to use the “+” symbol.

ggplot(data = mpg) + #this tells ggplot which dataset to use
  geom_point(mapping = aes(x = displ, y = hwy)) + #this sets the x and y axis and plots each data point
  geom_smooth(mapping = aes(x = displ, y = hwy), method = lm) #this creates the line of best fit

## `geom_smooth()` using formula = 'y ~ x'

Add color to the plot based on a categorical variable

To explore this dataset further, we can add more information to the plot. Next, we will make the points on the plot different colors based on the type of car.

ggplot(data = mpg) + #this tells ggplot which dataset to use
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + #this sets the x and y axis and plots each data point, "color" is used here to differ the color of the dots based on the class of car (suv, etc.)
  geom_smooth(mapping = aes(x = displ, y = hwy), method = lm, color = "black") #here, we set the color of the line explicitly to black

## `geom_smooth()` using formula = 'y ~ x'

Create a set of scatterplots based on a categorical variable

Sometimes it is useful to see how the the relationship between two variables changes based on a third. While the above graph helps us to that, we can also display each class in its own graph.

ggplot(data = mpg) + #this tells ggplot which dataset to use
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + #this sets the x and y axis and plots each data point, "color" is used here to differ the color of the dots based on the class of car (suv, etc.)
  geom_smooth(mapping = aes(x = displ, y = hwy), method = lm, color = "black") + #here, we set the color of the line explicitly to black
  facet_wrap(~ class, ncol = 2) #this will make multiple graphs based on class. Note that ncol (number of columns) and nrow (number of rows) can be set explicitly (but don't have to be)

## `geom_smooth()` using formula = 'y ~ x'

Working with categorial variables: Box plots

Another way that we commonly analyze data is to compare two or more groups based on particular criteria. Below, we will compare the average fuel efficiency of cars based on their class using boxplots. Note that the center line indicates the median score (in this case, fuel efficiency), the boxes indicate the 25% quantile (lower line) and 75% quantile (upper line), and the whiskers include data points that extend no more than 1.5 times the interquartile range. The dots represent outliers. (We will discuss most of these terms in the next tutorial!)

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy))

If we want, we can also plot the actual data on top of the boxplots using geom_jitter():

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy), outlier.shape = NA) +
  geom_jitter(mapping = aes(x = class, y = hwy), width = 0.2)

We can also compare data based on another categorical variable. Below we distinguish each category based on drive type (four wheel drive, front wheel drive, and rear wheel drive).

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy, color = drv))

Working with frequency data: Bar Charts

For this section, we will be working with another built in dataset that has a number of characteristics of c. 54,000 diamonds. Remember that we can use the “summary()” and “help()” functions to learn more about this dataset. For now, we will make a simple bar chart that will indicate the number of diamonds that have various cut qualities.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut)) #note that in these charts, we only need to declare an x variable

We can, of course, change many visual aspects of this plot. Below, we will vary the bar colors based on cut:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) #note that in these charts, we only need to declare an x variable

We can also compare the distribution of other categorical variables within each category on the x axis by changing the “fill” variable. Below, we will see the distribution of clarity within each cut category.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity)) #note that in these charts, we only need to declare an x variable

We can also adjust these plots to look at the proportion of clarity across cut by using the position = “fill” argument:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") #note that in these charts, we only need to declare an x variable

Or, we can look at histograms of clarity for each cut using the position = “dodge” argument:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") #note that in these charts, we only need to declare an x variable

Introduction to Data Visualization

Kristopher Kyle

2019-02-04; revised 2022-01-20