To fully understand the information on this page, be sure to read the previous tutorial on correlation.
Regression is used for value prediction. For example, lets imagine that we have a set of writing proficiency scores (dependent variable) and corresponding mean word frequency values (which are a measure of lexical sophistication; this would be an independent variable). Regression allows us to predict the writing proficiency scores based on the mean frequency values. The stronger the relationship is between the two values, the better (more accurate) the predictions will be.
Conceptually (and mathematically), regression is related to correlation. The main difference is that the results of a regression include more information, namely the characteristics of the line of best fit through the data.
In this tutorial, we are going to use average word frequency scores (frequency_CW) to predict scores for nwords (which we are using as a [flawed] proxy for proficiency) in a corpus of learner argumentative essays.
The assumptions for (single) linear regression are almost the same as for correlations.
The main assumptions of (single) linear regression are:
The variables must be continuous (see other tests for ordinal or categorical data)
The variables must have a linear relationship
There are no outliers (or there are only minimal outliers in large samples)
The residuals (i.e., the prediction errors) are normally distributed
First, we will load our data:
library(ggplot2) #load ggplot2 cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe summary(cor_data) #get descriptive statistics for the dataset
## Score pass.fail Prompt nwords ## Min. :1.000 Length:480 Length:480 Min. : 61.0 ## 1st Qu.:3.000 Class :character Class :character 1st Qu.:273.0 ## Median :3.500 Mode :character Mode :character Median :321.0 ## Mean :3.427