To fully understand the information on this page, be sure to read the previous tutorial on correlation.

Regression is used for value prediction. For example, lets imagine that we have a set of writing proficiency scores (dependent variable) and corresponding mean word frequency values (which are a measure of lexical sophistication; this would be an independent variable). Regression allows us to predict the writing proficiency scores based on the mean frequency values. The stronger the relationship is between the two values, the better (more accurate) the predictions will be.

Conceptually (and mathematically), regression is related to correlation. The main difference is that the results of a regression include more information, namely the characteristics of the line of best fit through the data.

In this tutorial, we are going to use average word frequency scores (frequency_CW) to predict scores for nwords (which we are using as a [flawed] proxy for proficiency) in a corpus of learner argumentative essays.

The assumptions for (single) linear regression are almost the same as for correlations.

The main assumptions of (single) linear regression are:

The variables must be continuous (see other tests for ordinal or categorical data)

The variables must have a linear relationship

There are no outliers (or there are only minimal outliers in large samples)

The residuals (i.e., the prediction errors) are normally distributed

First, we will load our data:

```
library(ggplot2) #load ggplot2
library(viridis) #color-friendly palettes
cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe
summary(cor_data) #get descriptive statistics for the dataset
```

```
## Score pass.fail Prompt nwords
## Min. :1.000 Length:480 Length:480 Min. : 61.0
## 1st Qu.:3.000 Class :character Class :character 1st Qu.:273.0
## Median :3.500 Mode :character Mode :character Median :321.0
## Mean :3.427 Mean :317.7
## 3rd Qu.:4.000 3rd Qu.:355.2
## Max. :5.000 Max. :586.0
## frequency_AW frequency_CW frequency_FW bigram_frequency
## Min. :2.963 Min. :2.232 Min. :3.598 Min. :1.240
## 1st Qu.:3.187 1st Qu.:2.656 1st Qu.:3.827 1st Qu.:1.440
## Median :3.237 Median :2.726 Median :3.903 Median :1.500
## Mean :3.234 Mean :2.723 Mean :3.902 Mean :1.500
## 3rd Qu.:3.284 3rd Qu.:2.789 3rd Qu.:3.975 3rd Qu.:1.559
## Max. :3.489 Max. :3.095 Max. :4.235 Max. :1.755
```

```
library(psych)
describe(cor_data) #get descriptive statistics
```

```
## vars n mean sd median trimmed mad min max range
## Score 1 480 3.43 0.89 3.50 3.43 0.74 1.00 5.00 4.00
## pass.fail* 2 480 1.54 0.50 2.00 1.54 0.00 1.00 2.00 1.00
## Prompt* 3 480 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00
## nwords 4 480 317.65 78.28 321.00 316.93 57.82 61.00 586.00 525.00
## frequency_AW 5 480 3.23 0.07 3.24 3.23 0.07 2.96 3.49 0.53
## frequency_CW 6 480 2.72 0.11 2.73 2.72 0.10 2.23 3.10 0.86
## frequency_FW 7 480 3.90 0.11 3.90 3.90 0.11 3.60 4.23 0.64
## bigram_frequency 8 480 1.50 0.09 1.50 1.50 0.09 1.24 1.75 0.52
## skew kurtosis se
## Score 0.00 -0.62 0.04
## pass.fail* -0.14 -1.98 0.02
## Prompt* 0.00 -2.00 0.02
## nwords 0.12 1.10 3.57
## frequency_AW -0.10 0.54 0.00
## frequency_CW -0.14 1.08 0.00
## frequency_FW 0.10 -0.21 0.01
## bigram_frequency 0.02 -0.11 0.00
```

Our variables of interest (frequency_CW) and number of words (nwords) are both continuous variables, so we meet the first assumption.

Next, we will test the assumption of linearity using a Loess line (in red) and a line of best fit (in blue). These indicate that our data are roughly linear (because the red line mirrors the blue one). Note that this is the same data from the correlation tutorial.

```
g1 <- ggplot(cor_data, aes(x = frequency_CW , y=nwords )) +
geom_point() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
scale_color_viridis(discrete = TRUE) +
theme_minimal()
#print(g1)
```