Often, particular constructs (such as linguistic proficiency) are multifaceted and are best measured using multiple measures. In such cases, we can use multiple regression to predict a dependent variable (such as linguistic proficiency) using multiple independent variables (such as features of lexical sophistication and syntactic complexity). This tutorial builds on the previous two tutorials on Correlation and Linear Regression, so be sure to check those out. In this tutorial, we are going to predict holistic writing quality scores (Score) using a number of linguistic features related to frequency and an index of syntactic complexity (mean length of clause).
The assumptions for multiple regression are very similar to those of Pearson correlations and (single) linear regression. We do, however, add one important assumption: (non) multicollinearity.
The main assumptions of (single) linear regression are:
The variables must be continuous (see other tests for ordinal or categorical data)
The variables must have a linear relationship with the dependent variable
There are no outliers (or there are only minimal outliers in large samples)
The residuals must be normally distributed
The predictor variables are not strongly correlated with each other (this is referred to as multicollinearity)
First, we will load our data, then we will make a series of scatterplots. Note that “Score” represents holistic writing quality scores that range from 1-5 in .5 point increments. For the purposes of this tutorial, we will consider “Score” to be a continuous variable. Also note that we will use geom_jitter() instead of geom_point() to help us visualize the linearity.
mr_data <- read.csv("data/multiple_regression_sample.csv", header = TRUE)
summary(mr_data)
## Score frequency_AW frequency_CW frequency_FW
## Min. :1.000 Min. :2.963 Min. :2.232 Min. :3.598
## 1st Qu.:3.000 1st Qu.:3.187 1st Qu.:2.656 1st Qu.:3.827
## Median :3.500 Median :3.237 Median :2.726 Median :3.903
## Mean :3.427 Mean :3.234 Mean :2.723 Mean :3.902
## 3rd Qu.:4.000 3rd Qu.:3.284 3rd Qu.:2.789 3rd Qu.:3.975
## Max. :5.000 Max. :3.489 Max. :3.095 Max. :4.235
## bigram_frequency MLC
## Min. :1.240 Min. : 5.769
## 1st Qu.:1.440 1st Qu.: 8.357
## Median :1.500 Median : 9.584
## Mean :1.500 Mean : 9.730
## 3rd Qu.:1.559 3rd Qu.:10.794
## Max. :1.755 Max. :20.381
Frequency_AW
library(ggplot2)
library(viridis) #color-friendly palettes
g1 <- ggplot(mr_data, aes(x = frequency_AW , y=Score)) +
geom_jitter() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
theme_minimal()
#print(g1)
Frequency_CW
g2 <- ggplot(mr_data, aes(x = frequency_CW , y=Score)) +
geom_jitter() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") +#this is a line of best fit based on the entire dataset
theme_minimal()
#print(g2)
Frequency_FW
g3 <- ggplot(mr_data, aes(x = frequency_FW , y=Score )) +
geom_jitter() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
theme_minimal()
#print(g3)
bigram_frequency
g4 <- ggplot(mr_data, aes(x = bigram_frequency , y=Score)) +
geom_jitter() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
theme_minimal()
#print(g4)
MLC
g5 <- ggplot(mr_data, aes(x = MLC , y=Score )) +
geom_jitter() +
geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
theme_minimal()
#print(g5)
For the purposes of this tutorial, we will suggest that our data roughly meet Assumptions 1-3, though we might argue for a few violations (e.g., there seem to be outliers in the MLC scatterplot).
We will check this assumption with QQ plots after our model is created.
After checking for all of the assumptions, it is time to run our data. Again, R makes this very easy. We will use the lm() function, which is the same function we used for single linear regression. In this case, however, we will be adding multiple predictor variables (frequency_CW, MLC, and frequency_FW). Note that for a “normal” multiple regression, the order of the predictors matters. Each variable will be added to the model sequentially. We can often sequence variables based on theoretical parameters. In this case, we have none, so they are entered in order of the strength of their relationship with Score.
#define model2 as a regression model using frequency_CW, MLC, and frequency_FW to predict nwords
model1 <- lm(Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)
summary(model1)
##
## Call:
## lm(formula = Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40370 -0.51357 -0.07077 0.60695 1.92361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.84766 1.90214 3.074 0.00223 **
## frequency_CW -2.31913 0.42504 -5.456 7.84e-08 ***
## MLC 0.01449 0.02599 0.558 0.57741
## frequency_FW 0.96171 0.39058 2.462 0.01416 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8332 on 476 degrees of freedom
## Multiple R-squared: 0.1233, Adjusted R-squared: 0.1178
## F-statistic: 22.32 on 3 and 476 DF, p-value: 1.55e-13
The summary() output indicates that 12.3% of the variance in essay scores can be explained by the model (r-squared = .1233, adjusted r-squared = .1178) using three variables. However, our Coefficients table also indicates that MLC does not significantly improve the model (there is only a 57.7% chance that adding MLC changes the model). In this case, we can choose to respecify the model without MLC, which will simplify our model (and may make it more accurate). We do this in model2 below.
#define model2 as a regression model using frequency_CW, MLC, and frequency_FW to predict nwords
model2 <- lm(Score ~ frequency_CW + frequency_FW, data = mr_data)
summary(model2)
##
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40074 -0.50240 -0.06091 0.61036 1.94582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8934 1.8990 3.103 0.00203 **
## frequency_CW -2.4295 0.3759 -6.463 2.53e-10 ***
## frequency_FW 1.0631 0.3454 3.078 0.00220 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8326 on 477 degrees of freedom
## Multiple R-squared: 0.1227, Adjusted R-squared: 0.1191
## F-statistic: 33.37 on 2 and 477 DF, p-value: 2.729e-14
As we can see, removing MLC actually improved our model performance slightly, both in terms of r-squared values and adjusted r-squared values.
To predict Score, we can use the following formula: Score = (frequency_CW * -2.4295) + (frequency_FW * 1.0631) + 5.8934
In order to plot our results, we have to add the scores predicted by the model to our dataframe. This is demonstrated below.
#To add a column to our dataframe, we just define it
mr_data$pred_Score <- predict(model2)
#We can use summary() to make sure it worked correctly
summary(mr_data)
## Score frequency_AW frequency_CW frequency_FW
## Min. :1.000 Min. :2.963 Min. :2.232 Min. :3.598
## 1st Qu.:3.000 1st Qu.:3.187 1st Qu.:2.656 1st Qu.:3.827
## Median :3.500 Median :3.237 Median :2.726 Median :3.903
## Mean :3.427 Mean :3.234 Mean :2.723 Mean :3.902
## 3rd Qu.:4.000 3rd Qu.:3.284 3rd Qu.:2.789 3rd Qu.:3.975
## Max. :5.000 Max. :3.489 Max. :3.095 Max. :4.235
## bigram_frequency MLC pred_Score
## Min. :1.240 Min. : 5.769 Min. :2.371
## 1st Qu.:1.440 1st Qu.: 8.357 1st Qu.:3.218
## Median :1.500 Median : 9.584 Median :3.432
## Mean :1.500 Mean : 9.730 Mean :3.427
## 3rd Qu.:1.559 3rd Qu.:10.794 3rd Qu.:3.622
## Max. :1.755 Max. :20.381 Max. :4.738
Now, we can easily plot the relationship between Score and the values predicted by the model (pred_Score). As we can see, our model includes quite a lot of error!
library(ggplot2)
g6 <- ggplot(mr_data, aes(y=pred_Score, x = Score)) +
geom_jitter() +
geom_smooth(method="lm") +
theme_minimal()
#print(g6)
<img src=“plot6.png” alt=““Scatter plot showing black points and a smooth line running through the data points. The line shows a line of best fit. The gray shaded area around the line indicates the confidence interval for that regression.”>
A perfectly normal distribution of residuals would be when all data points fall perfectly on the line. As we can see, we have some outliers (particularly on the right side of the graph), but the distribution is roughly normal.
library(car)
qqPlot(model1)
If we have no theoretical basis for presuming a particular order of importance for our variables, we can use a stepwise model to automatically select the optimal number and order of variables. Note that this is common in educational data mining and other related fields, but is sometimes controversial. Basically, it is important to only include variables that you have a good reason to believe (e.g., theoretically or based on previous evidence) are appropriate predictors of the dependent variable.
Below, we will conduct a stepwise regression using the MASS package (don’t forget to install it!). Note that the first argument of stepAIC() is a model (e.g., lm(Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)). Here, we are using our first model (model1).
#install.packages("MASS") #if not already installed
library(MASS)
stepwise_model <- stepAIC(model1, direction="both")
## Start: AIC=-171.18
## Score ~ frequency_CW + MLC + frequency_FW
##
## Df Sum of Sq RSS AIC
## - MLC 1 0.2158 330.68 -172.86
## <none> 330.47 -171.18
## - frequency_FW 1 4.2091 334.67 -167.10
## - frequency_CW 1 20.6682 351.13 -144.06
##
## Step: AIC=-172.86
## Score ~ frequency_CW + frequency_FW
##
## Df Sum of Sq RSS AIC
## <none> 330.68 -172.86
## + MLC 1 0.2158 330.47 -171.18
## - frequency_FW 1 6.5678 337.25 -165.42
## - frequency_CW 1 28.9601 359.64 -134.57
summary(stepwise_model)
##
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40074 -0.50240 -0.06091 0.61036 1.94582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8934 1.8990 3.103 0.00203 **
## frequency_CW -2.4295 0.3759 -6.463 2.53e-10 ***
## frequency_FW 1.0631 0.3454 3.078 0.00220 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8326 on 477 degrees of freedom
## Multiple R-squared: 0.1227, Adjusted R-squared: 0.1191
## F-statistic: 33.37 on 2 and 477 DF, p-value: 2.729e-14
As we see, the stepwise model dropped MLC, just like we did, and ended up with the same model as we saw in model2.
When predicting scores, it is often important to ensure that our model is generalizable beyond the current dataset. This can be accomplished using training and test sets and/or through cross-validation.
This procedure involves randomly dividing a datset into two pieces (often via a 2/3rds, 1/3rd split). First, a model is created using the larger set, then tested on new data (the smaller set). See below for an example.
Note that this uses the caret package, so don’t forget to install it.
#install.packages("caret") #if not already installed
library(caret)
set.seed(1) #do this to ensure reproducibility - otherwise you will get slightly different results each time
#create random sample based on Score values, training set will be 67% of data
trainIndex = createDataPartition(mr_data$Score, p = .67, list = FALSE, times = 1)
#define training data
mr_dataTrain <- mr_data[trainIndex,]
#define test data
mr_dataTest <- mr_data[-trainIndex,]
#create model using test data:
train_model <- lm(Score ~ frequency_CW + frequency_FW, data = mr_dataTrain)
#get training set results
summary(train_model)
##
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_dataTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07644 -0.50199 -0.07987 0.58978 1.87751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3241 2.3813 1.396 0.16371
## frequency_CW -1.9190 0.4620 -4.154 4.2e-05 ***
## frequency_FW 1.3688 0.4311 3.175 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8348 on 320 degrees of freedom
## Multiple R-squared: 0.1076, Adjusted R-squared: 0.102
## F-statistic: 19.28 on 2 and 320 DF, p-value: 1.239e-08
Using the smaller dataset for training, we see that the model is slightly less predictive than our original model2 (which used the whole dataset).
Now, we can apply this model to the test set, and evaluate its performance.
#predict scores in test data and save results as the variable "test_pred"
mr_dataTest$test_pred <- predict(train_model,mr_dataTest)
#determine the relationship between the actual and predicted scores in the test data
cor.test(mr_dataTest$Score,mr_dataTest$test_pred)
##
## Pearson's product-moment correlation
##
## data: mr_dataTest$Score and mr_dataTest$test_pred
## t = 5.1419, df = 155, p-value = 8.1e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2394076 0.5079940
## sample estimates:
## cor
## 0.3817307
#get R-squared value
cor(mr_dataTest$Score,mr_dataTest$test_pred)^2
## [1] 0.1457183
As we see from the results, the model from the training data actually worked better (r-squared = .130) on the test set than the training set (r-squared = .112)! This provides some evidence that our model can be generalized to new (but similar) datasets.
Cross-validation involves using multiple training and test sets, then averaging the model performance across all test sets. Commonly, the data is divided into ten segments (though this can vary widely!). In the case of ten segments, the data from 9 segments is used to train the model, which is then tested on the tenth segment. This process is repeated until all segments have been used as the test set, and which time the results from each test set are averaged. Cross-validation is popular because it allows for larger training sets (and avoids some of the pitfalls of random selection).
#library(caret)
set.seed(1) #do this to ensure reproducibility - otherwise you will get slightly different results each time
#create cross-validation parameters
TenFoldCV <- trainControl(method = "repeatedcv", number = 10,repeats = 10)
#assign formula from stepwise model (this was the model defined earlier)
step_mod<-formula(stepwise_model)
#create cross-validated model
mTenFold<-train(step_mod,data=mr_data,method='lm',trControl=TenFoldCV)
#get results
mTenFold$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.8319323 0.1358885 0.6769704 0.06143786 0.09085329 0.050142
Here, we are most interested in the Rsquared value, which indicates that the model explained 13.7% of the variance in Score (Rsquared = 0.137). This is not markedly different from the full set model, which indicates that our original model was stable across the dataset.