Back to Homepage

Regression: Predicting dependent variable values based on independent variable values

To fully understand the information on this page, be sure to read the previous tutorial on correlation.

Regression is used for value prediction. For example, lets imagine that we have a set of writing proficiency scores (dependent variable) and corresponding mean word frequency values (which are a measure of lexical sophistication; this would be an independent variable). Regression allows us to predict the writing proficiency scores based on the mean frequency values. The stronger the relationship is between the two values, the better (more accurate) the predictions will be.

Conceptually (and mathematically), regression is related to correlation. The main difference is that the results of a regression include more information, namely the characteristics of the line of best fit through the data.

In this tutorial, we are going to use average word frequency scores (frequency_CW) to predict scores for nwords (which we are using as a [flawed] proxy for proficiency) in a corpus of learner argumentative essays.


The assumptions for (single) linear regression are almost the same as for correlations.

The main assumptions of (single) linear regression are:

  • The variables must be continuous (see other tests for ordinal or categorical data)

  • The variables must have a linear relationship

  • There are no outliers (or there are only minimal outliers in large samples)

  • The residuals (i.e., the prediction errors) are normally distributed

Checking assumptions

First, we will load our data:

library(ggplot2) #load ggplot2
cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe
summary(cor_data) #get descriptive statistics for the dataset
##      Score            Prompt              nwords     
##  Min.   :1.000   Length:480         Length:480         Min.   : 61.0  
##  1st Qu.:3.000   Class :character   Class :character   1st Qu.:273.0  
##  Median :3.500   Mode  :character   Mode  :character   Median :321.0  
##  Mean   :3.427