The objectives of this tutorial are to:
One particularly powerful aspect of R is that it enables one visualize data in a variety of ways. Additionally, it gives users a wide variety of customization options. We will work below on a number of common data visualizations. But rememeber, once you get the hang of the basics, the sky is the limit! Note that this tutorial is a simplified version of [this one] (https://r4ds.had.co.nz/data-visualisation.html). For more details on what is possible, check out the source webpages!
To get started, lets load ggplot and look at one of the datasets that comes with ggplot:
library(ggplot2)
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
As we can see from the summary, this dataset includes a number of characteristics of cars that might affect fuel efficiency (in miles per gallon for city or highway driving) from 1999 to 2008. Remember that we can get more detailed information about the dataset by using the “help” function in R:
help(mpg)
A scatterplot is a two-dimensional plot where each point represents two characteristics of an observation. The two characteristics (which we can call variables) are represented by two axes of the plot. The x axis runs horizontally, while the y axis runs vertically.
The scatterplot we will create below, each point will represent information about a particular vehicle. The x axis indicates the size of the vehicle’s engine (measured as engine displacement in liters), while the y axis indicates the vehicle’s fuel efficiency (measured as miles per gallon on the highway).
We will often have hypotheses about the nature of our data prior to analyzing it. My simple prediction is that vehicles with larger engines will tend to get lower fuel efficiency. Now, we will visualize our data and see if my prediction seems to be accurate.
ggplot(data = mpg) + #this tells ggplot which dataset to use
geom_point(mapping = aes(x = displ, y = hwy)) #this sets the x and y axis and plots each data point
Now, let’s look at the plot and try to make sense of what it is telling us. First, lets look at the x axis, which represents engine displacement. If we look at engines that have around 2 liters of displacement or less, what kind of fuel efficiency are they getting (see the y axis, which represents highway fuel efficiency)? As engine displacement increases, what happens to fuel efficiency?
As we can see from the graph, as engine displacement goes up, highway fuel efficiency tends to go down.
We can also add a line of best fit, which summarizes the data in the scatterplot using a single line. Note that adding layers to plots using ggplot is very simple. We only need to use the “+” symbol.
ggplot(data = mpg) + #this tells ggplot which dataset to use
geom_point(mapping = aes(x = displ, y = hwy)) + #this sets the x and y axis and plots each data point
geom_smooth(mapping = aes(x = displ, y = hwy), method = lm) #this creates the line of best fit
## `geom_smooth()` using formula 'y ~ x'
To explore this dataset further, we can add more information to the plot. Next, we will make the points on the plot different colors based on the type of car.
ggplot(data = mpg) + #this tells ggplot which dataset to use
geom_point(mapping = aes(x = displ, y = hwy, color = class)) + #this sets the x and y axis and plots each data point, "color" is used here to differ the color of the dots based on the class of car (suv, etc.)
geom_smooth(mapping = aes(x = displ, y = hwy), method = lm, color = "black") #here, we set the color of the line explicitly to black
## `geom_smooth()` using formula 'y ~ x'
Sometimes it is useful to see how the the relationship between two variables changes based on a third. While the above graph helps us to that, we can also display each class in its own graph.
ggplot(data = mpg) + #this tells ggplot which dataset to use
geom_point(mapping = aes(x = displ, y = hwy, color = class)) + #this sets the x and y axis and plots each data point, "color" is used here to differ the color of the dots based on the class of car (suv, etc.)
geom_smooth(mapping = aes(x = displ, y = hwy), method = lm, color = "black") + #here, we set the color of the line explicitly to black
facet_wrap(~ class, ncol = 2) #this will make multiple graphs based on class. Note that ncol (number of columns) and nrow (number of rows) can be set explicitly (but don't have to be)
## `geom_smooth()` using formula 'y ~ x'
Another way that we commonly analyze data is to compare two or more groups based on particular criteria. Below, we will compare the average fuel efficiency of cars based on their class using boxplots. Note that the center line indicates the median score (in this case, fuel efficiency), the boxes indicate the 25% quantile (lower line) and 75% quantile (upper line), and the whiskers include data points that extend no more than 1.5 times the interquartile range. The dots represent outliers. (We will discuss most of these terms in the next tutorial!)
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))
If we want, we can also plot the actual data on top of the boxplots using geom_jitter():
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy), outlier.shape = NA) +
geom_jitter(mapping = aes(x = class, y = hwy), width = 0.2)
We can also compare data based on another categorical variable. Below we distinguish each category based on drive type (four wheel drive, front wheel drive, and rear wheel drive).
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy, color = drv))
For this section, we will be working with another built in dataset that has a number of characteristics of c. 54,000 diamonds. Remember that we can use the “summary()” and “help()” functions to learn more about this dataset. For now, we will make a simple bar chart that will indicate the number of diamonds that have various cut qualities.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) #note that in these charts, we only need to declare an x variable
We can, of course, change many visual aspects of this plot. Below, we will vary the bar colors based on cut:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut)) #note that in these charts, we only need to declare an x variable
We can also compare the distribution of other categorical variables within each category on the x axis by changing the “fill” variable. Below, we will see the distribution of clarity within each cut category.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity)) #note that in these charts, we only need to declare an x variable
We can also adjust these plots to look at the proportion of clarity across cut by using the position = “fill” argument:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") #note that in these charts, we only need to declare an x variable
Or, we can look at histograms of clarity for each cut using the position = “dodge” argument:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") #note that in these charts, we only need to declare an x variable