Back to Homepage

Regression: Predicting dependent variable values based on independent variable values

To fully understand the information on this page, be sure to read the previous tutorial on correlation.

Regression is used for value prediction. For example, lets imagine that we have a set of writing proficiency scores (dependent variable) and corresponding mean word frequency values (which are a measure of lexical sophistication; this would be an independent variable). Regression allows us to predict the writing proficiency scores based on the mean frequency values. The stronger the relationship is between the two values, the better (more accurate) the predictions will be.

Conceptually (and mathematically), regression is related to correlation. The main difference is that the results of a regression include more information, namely the characteristics of the line of best fit through the data.

In this tutorial, we are going to use average word frequency scores (frequency_CW) to predict scores for nwords (which we are using as a [flawed] proxy for proficiency) in a corpus of learner argumentative essays.

Assumptions

The assumptions for (single) linear regression are almost the same as for correlations.

The main assumptions of (single) linear regression are:

  • The variables must be continuous (see other tests for ordinal or categorical data)

  • The variables must have a linear relationship

  • There are no outliers (or there are only minimal outliers in large samples)

  • The residuals (i.e., the prediction errors) are normally distributed

Checking assumptions

First, we will load our data:

library(ggplot2) #load ggplot2
library(viridis) #color-friendly palettes

cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe
summary(cor_data) #get descriptive statistics for the dataset
##      Score        pass.fail            Prompt              nwords     
##  Min.   :1.000   Length:480         Length:480         Min.   : 61.0  
##  1st Qu.:3.000   Class :character   Class :character   1st Qu.:273.0  
##  Median :3.500   Mode  :character   Mode  :character   Median :321.0  
##  Mean   :3.427                                         Mean   :317.7  
##  3rd Qu.:4.000                                         3rd Qu.:355.2  
##  Max.   :5.000                                         Max.   :586.0  
##   frequency_AW    frequency_CW    frequency_FW   bigram_frequency
##  Min.   :2.963   Min.   :2.232   Min.   :3.598   Min.   :1.240   
##  1st Qu.:3.187   1st Qu.:2.656   1st Qu.:3.827   1st Qu.:1.440   
##  Median :3.237   Median :2.726   Median :3.903   Median :1.500   
##  Mean   :3.234   Mean   :2.723   Mean   :3.902   Mean   :1.500   
##  3rd Qu.:3.284   3rd Qu.:2.789   3rd Qu.:3.975   3rd Qu.:1.559   
##  Max.   :3.489   Max.   :3.095   Max.   :4.235   Max.   :1.755
library(psych)
describe(cor_data) #get descriptive statistics
##                  vars   n   mean    sd median trimmed   mad   min    max  range
## Score               1 480   3.43  0.89   3.50    3.43  0.74  1.00   5.00   4.00
## pass.fail*          2 480   1.54  0.50   2.00    1.54  0.00  1.00   2.00   1.00
## Prompt*             3 480   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.00
## nwords              4 480 317.65 78.28 321.00  316.93 57.82 61.00 586.00 525.00
## frequency_AW        5 480   3.23  0.07   3.24    3.23  0.07  2.96   3.49   0.53
## frequency_CW        6 480   2.72  0.11   2.73    2.72  0.10  2.23   3.10   0.86
## frequency_FW        7 480   3.90  0.11   3.90    3.90  0.11  3.60   4.23   0.64
## bigram_frequency    8 480   1.50  0.09   1.50    1.50  0.09  1.24   1.75   0.52
##                   skew kurtosis   se
## Score             0.00    -0.62 0.04
## pass.fail*       -0.14    -1.98 0.02
## Prompt*           0.00    -2.00 0.02
## nwords            0.12     1.10 3.57
## frequency_AW     -0.10     0.54 0.00
## frequency_CW     -0.14     1.08 0.00
## frequency_FW      0.10    -0.21 0.01
## bigram_frequency  0.02    -0.11 0.00

Our variables of interest (frequency_CW) and number of words (nwords) are both continuous variables, so we meet the first assumption.

Next, we will test the assumption of linearity using a Loess line (in red) and a line of best fit (in blue). These indicate that our data are roughly linear (because the red line mirrors the blue one). Note that this is the same data from the correlation tutorial.

g1 <- ggplot(cor_data, aes(x = frequency_CW , y=nwords )) +
  geom_point() +
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  scale_color_viridis(discrete = TRUE) +  
  theme_minimal() 

#print(g1)