Often, particular constructs (such as linguistic proficiency) are multifaceted and are best measured using multiple measures. In such cases, we can use multiple regression to predict a dependent variable (such as linguistic proficiency) using multiple independent variables (such as features of lexical sophistication and syntactic complexity). This tutorial builds on the previous two tutorials on Correlation and Linear Regression, so be sure to check those out. In this tutorial, we are going to predict holistic writing quality scores (Score) using a number of linguistic features related to frequency and an index of syntactic complexity (mean length of clause).
The assumptions for multiple regression are very similar to those of Pearson correlations and (single) linear regression. We do, however, add one important assumption: (non) multicollinearity.
The main assumptions of (single) linear regression are:
The variables must be continuous (see other tests for ordinal or categorical data)
The variables must have a linear relationship with the dependent variable
There are no outliers (or there are only minimal outliers in large samples)
The residuals must be normally distributed
The predictor variables are not strongly correlated with each other (this is referred to as multicollinearity)
First, we will load our data, then we will make a series of scatterplots. Note that “Score” represents holistic writing quality scores that range from 1-5 in .5 point increments. For the purposes of this tutorial, we will consider “Score” to be a continuous variable. Also note that we will use geom_jitter() instead of geom_point() t