Often, particular constructs (such as linguistic proficiency) are multifaceted and are best measured using multiple measures. In such cases, we can use multiple regression to predict a dependent variable (such as linguistic proficiency) using multiple independent variables (such as features of lexical sophistication and syntactic complexity). This tutorial builds on the previous two tutorials on Correlation and Linear Regression, so be sure to check those out. In this tutorial, we are going to predict holistic writing quality scores (Score) using a number of linguistic features related to frequency and an index of syntactic complexity (mean length of clause).
The assumptions for multiple regression are very similar to those of Pearson correlations and (single) linear regression. We do, however, add one important assumption: (non) multicollinearity.
The main assumptions of (single) linear regression are:
The variables must be continuous (see other tests for ordinal or categorical data)
The variables must have a linear relationship with the dependent variable
There are no outliers (or there are only minimal outliers in large samples)
The residuals must be normally distributed
The predictor variables are not strongly correlated with each other (this is referred to as multicollinearity)
First, we will load our data, then we will make a series of scatterplots. Note that “Score” represents holistic writing quality scores that range from 1-5 in .5 point increments. For the purposes of this tutorial, we will consider “Score” to be a continuous variable. Also note that we will use geom_jitter() instead of geom_point() to help us visualize the linearity.
mr_data <- read.csv("data/multiple_regression_sample.csv", header = TRUE) summary(mr_data)
## Score frequency_AW frequency_CW frequency_FW ## Min. :1.000 Min. :2.963 Min. :2.232 Min. :3.598 ## 1st Qu.:3.000 1st Qu.:3.187 1st Qu.:2.656 1st Qu.:3.827 ## Median :3.500 Median :3.237 Median :2.726 Median :3.903 ## Mean :3.427 Mean :3.234 Mean :2.723 Mean :3.902 ## 3rd Qu.:4.000 3rd Qu.:3.284 3rd Qu.:2.789 3rd Qu.:3.975 ## Max. :5.000 Max. :3.489 Max. :3.095 Max. :4.235 ## bigram_frequency MLC ## Min. :1.240 Min. : 5.769 ## 1st Qu.:1.440 1st Qu.: 8.357 ## Median :1.500 Median : 9.584 ## Mean :1.500 Mean : 9.730 ## 3rd Qu.:1.559 3rd Qu.:10.794 ## Max. :1.755 Max. :20.381
library(ggplot2) ggplot(mr_data, aes(x = frequency_AW , y=Score )) + geom_jitter() + geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average geom_smooth(method = "lm") #this is a line of best fit based on the enture dataset
ggplot(mr_data, aes(x = frequency_CW , y=Score )) + geom_jitter() + geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average geom_smooth(method = "lm") #this is a line of best fit based on the enture dataset