Back to Homepage

Multiple Regresion: Using multiple independent variables to predict dependent variable values

Often, particular constructs (such as linguistic proficiency) are multifaceted and are best measured using multiple measures. In such cases, we can use multiple regression to predict a dependent variable (such as linguistic proficiency) using multiple independent variables (such as features of lexical sophistication and syntactic complexity). This tutorial builds on the previous two tutorials on Correlation and Linear Regression, so be sure to check those out. In this tutorial, we are going to predict holistic writing quality scores (Score) using a number of linguistic features related to frequency and an index of syntactic complexity (mean length of clause).

Assumptions

The assumptions for multiple regression are very similar to those of Pearson correlations and (single) linear regression. We do, however, add one important assumption: (non) multicollinearity.

The main assumptions of (single) linear regression are:

  • The variables must be continuous (see other tests for ordinal or categorical data)

  • The variables must have a linear relationship with the dependent variable

  • There are no outliers (or there are only minimal outliers in large samples)

  • The residuals must be normally distributed

  • The predictor variables are not strongly correlated with each other (this is referred to as multicollinearity)

Assumptions 1-3: The variables must be continuous, there must be a linear relationship between each independent (predictor) variable and the dependent variable, and there are no (minimal) outliers.

First, we will load our data, then we will make a series of scatterplots. Note that “Score” represents holistic writing quality scores that range from 1-5 in .5 point increments. For the purposes of this tutorial, we will consider “Score” to be a continuous variable. Also note that we will use geom_jitter() instead of geom_point() to help us visualize the linearity.

mr_data <- read.csv("data/multiple_regression_sample.csv", header = TRUE)
summary(mr_data)
##      Score        frequency_AW    frequency_CW    frequency_FW  
##  Min.   :1.000   Min.   :2.963   Min.   :2.232   Min.   :3.598  
##  1st Qu.:3.000   1st Qu.:3.187   1st Qu.:2.656   1st Qu.:3.827  
##  Median :3.500   Median :3.237   Median :2.726   Median :3.903  
##  Mean   :3.427   Mean   :3.234   Mean   :2.723   Mean   :3.902  
##  3rd Qu.:4.000   3rd Qu.:3.284   3rd Qu.:2.789   3rd Qu.:3.975  
##  Max.   :5.000   Max.   :3.489   Max.   :3.095   Max.   :4.235  
##  bigram_frequency      MLC        
##  Min.   :1.240    Min.   : 5.769  
##  1st Qu.:1.440    1st Qu.: 8.357  
##  Median :1.500    Median : 9.584  
##  Mean   :1.500    Mean   : 9.730  
##  3rd Qu.:1.559    3rd Qu.:10.794  
##  Max.   :1.755    Max.   :20.381

Frequency_AW

library(ggplot2)
library(viridis) #color-friendly palettes

g1 <- ggplot(mr_data, aes(x = frequency_AW , y=Score)) +
  geom_jitter() + 
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  theme_minimal()

#print(g1)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The red line meant to approximate the blue line. The gray shaded area around each line indicates the confidence interval for that regression.

Frequency_CW

g2 <- ggplot(mr_data, aes(x = frequency_CW , y=Score)) +
  geom_jitter() + 
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") +#this is a line of best fit based on the entire dataset
  theme_minimal()

#print(g2)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The red line meant to approximate the blue line. The gray shaded area around each line indicates the confidence interval for that regression.

Frequency_FW

g3 <- ggplot(mr_data, aes(x = frequency_FW , y=Score )) +
  geom_jitter() + 
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  theme_minimal()

#print(g3)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The gray shaded area around each line indicates the confidence interval for that regression.

bigram_frequency

g4 <- ggplot(mr_data, aes(x = bigram_frequency , y=Score)) +
  geom_jitter() + 
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  theme_minimal()

#print(g4)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The gray shaded area around each line indicates the confidence interval for that regression.

MLC

g5 <- ggplot(mr_data, aes(x = MLC , y=Score )) +
  geom_jitter() + 
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  theme_minimal()

#print(g5)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The gray shaded area around each line indicates the confidence interval for that regression.

For the purposes of this tutorial, we will suggest that our data roughly meet Assumptions 1-3, though we might argue for a few violations (e.g., there seem to be outliers in the MLC scatterplot).

Assumption 4: The residuals must be normally distributed

We will check this assumption with QQ plots after our model is created.

Assumption 5: The predictor variables are not strongly correlated with each other (multicollinearity)

If our independent predictor variables are strongly related, we are likely to overfit the data, causing the model to overestimate the explained variance.

In order to check for multicollinearity, we need to check the correlations between each of our variables. We can do this by running a series of cor.test() analyses, but it is easier to make a correlation matrix (see below).

To create the correlation matrix, we simple use the cor() function with our dataframe as the argument:

cor(mr_data)
##                         Score frequency_AW frequency_CW frequency_FW
## Score             1.000000000   -0.2110184   -0.3245257    0.2142725
## frequency_AW     -0.211018438    1.0000000    0.7224457    0.2480676
## frequency_CW     -0.324525714    0.7224457    1.0000000   -0.2684520
## frequency_FW      0.214272513    0.2480676   -0.2684520    1.0000000
## bigram_frequency  0.004382718    0.5779108    0.3329309    0.4408050
## MLC               0.240080506   -0.2052432   -0.5226900    0.5227331
##                  bigram_frequency         MLC
## Score                 0.004382718  0.24008051
## frequency_AW          0.577910801 -0.20524315
## frequency_CW          0.332930909 -0.52269000
## frequency_FW          0.440805004  0.52273307
## bigram_frequency      1.000000000  0.02642938
## MLC                   0.026429380  1.00000000

There are many thresholds for multicollinearity that can be used, but a common number is |.700|. (Note that studies have used both lower and higher thresholds than this.) To check for multicollinearity, we need to check our correlation matrix for any two indices that exceed a correlation value of |.700|. If any two indices exceed our collinearity threshold, we have to choose which to keep and which to discard. Generally speaking, we will keep whichever index has the strongest relationship with our dependent variable (in this case, Score).

In these data, we see that frequency_CW and frequency_AW are strongly correlated ( r = 0.722 ), so we have to choose which one to keep. If we look at the first column, we see that frequency_CW has a stronger relationship with Score ( r = -0.325) than frequency_AW ( r = -0.211), so we will discard frequency_AW and retain frequency_CW.

In the correlation matrix, we also see that bigram_frequency is not meaningfully related to Score ( r = .004), so we have no reason to believe that it will be useful in our model. Accordingly, we will not include it.

Conducting a multiple regression

After checking for all of the assumptions, it is time to run our data. Again, R makes this very easy. We will use the lm() function, which is the same function we used for single linear regression. In this case, however, we will be adding multiple predictor variables (frequency_CW, MLC, and frequency_FW). Note that for a “normal” multiple regression, the order of the predictors matters. Each variable will be added to the model sequentially. We can often sequence variables based on theoretical parameters. In this case, we have none, so they are entered in order of the strength of their relationship with Score.

#define model2 as a regression model using frequency_CW, MLC, and frequency_FW to predict nwords
model1 <- lm(Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)
summary(model1)
## 
## Call:
## lm(formula = Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40370 -0.51357 -0.07077  0.60695  1.92361 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.84766    1.90214   3.074  0.00223 ** 
## frequency_CW -2.31913    0.42504  -5.456 7.84e-08 ***
## MLC           0.01449    0.02599   0.558  0.57741    
## frequency_FW  0.96171    0.39058   2.462  0.01416 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8332 on 476 degrees of freedom
## Multiple R-squared:  0.1233, Adjusted R-squared:  0.1178 
## F-statistic: 22.32 on 3 and 476 DF,  p-value: 1.55e-13

Interpreting and refining the model

The summary() output indicates that 12.3% of the variance in essay scores can be explained by the model (r-squared = .1233, adjusted r-squared = .1178) using three variables. However, our Coefficients table also indicates that MLC does not significantly improve the model (there is only a 57.7% chance that adding MLC changes the model). In this case, we can choose to respecify the model without MLC, which will simplify our model (and may make it more accurate). We do this in model2 below.

#define model2 as a regression model using frequency_CW, MLC, and frequency_FW to predict nwords
model2 <- lm(Score ~ frequency_CW + frequency_FW, data = mr_data)
summary(model2)
## 
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40074 -0.50240 -0.06091  0.61036  1.94582 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.8934     1.8990   3.103  0.00203 ** 
## frequency_CW  -2.4295     0.3759  -6.463 2.53e-10 ***
## frequency_FW   1.0631     0.3454   3.078  0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8326 on 477 degrees of freedom
## Multiple R-squared:  0.1227, Adjusted R-squared:  0.1191 
## F-statistic: 33.37 on 2 and 477 DF,  p-value: 2.729e-14

As we can see, removing MLC actually improved our model performance slightly, both in terms of r-squared values and adjusted r-squared values.

To predict Score, we can use the following formula: Score = (frequency_CW * -2.4295) + (frequency_FW * 1.0631) + 5.8934

Plotting our results

In order to plot our results, we have to add the scores predicted by the model to our dataframe. This is demonstrated below.

#To add a column to our dataframe, we just define it
mr_data$pred_Score <- predict(model2)
#We can use summary() to make sure it worked correctly
summary(mr_data)
##      Score        frequency_AW    frequency_CW    frequency_FW  
##  Min.   :1.000   Min.   :2.963   Min.   :2.232   Min.   :3.598  
##  1st Qu.:3.000   1st Qu.:3.187   1st Qu.:2.656   1st Qu.:3.827  
##  Median :3.500   Median :3.237   Median :2.726   Median :3.903  
##  Mean   :3.427   Mean   :3.234   Mean   :2.723   Mean   :3.902  
##  3rd Qu.:4.000   3rd Qu.:3.284   3rd Qu.:2.789   3rd Qu.:3.975  
##  Max.   :5.000   Max.   :3.489   Max.   :3.095   Max.   :4.235  
##  bigram_frequency      MLC           pred_Score   
##  Min.   :1.240    Min.   : 5.769   Min.   :2.371  
##  1st Qu.:1.440    1st Qu.: 8.357   1st Qu.:3.218  
##  Median :1.500    Median : 9.584   Median :3.432  
##  Mean   :1.500    Mean   : 9.730   Mean   :3.427  
##  3rd Qu.:1.559    3rd Qu.:10.794   3rd Qu.:3.622  
##  Max.   :1.755    Max.   :20.381   Max.   :4.738

Now, we can easily plot the relationship between Score and the values predicted by the model (pred_Score). As we can see, our model includes quite a lot of error!

library(ggplot2)

g6 <- ggplot(mr_data, aes(y=pred_Score, x = Score)) +
  geom_jitter() +
  geom_smooth(method="lm") +
  theme_minimal()

#print(g6)

<img src=“plot6.png” alt=““Scatter plot showing black points and a smooth line running through the data points. The line shows a line of best fit. The gray shaded area around the line indicates the confidence interval for that regression.”>

Checking the assumption of normality (of the residuals)

A perfectly normal distribution of residuals would be when all data points fall perfectly on the line. As we can see, we have some outliers (particularly on the right side of the graph), but the distribution is roughly normal.

library(car)
qqPlot(model1)

A quantitative normal Q-Q plot displaying standardized residuals plotted against theoretical standard normal quantiles. The data points are represented by black circles mostly align with the blue diagonal line. The blue shaded area around the line shows a confidence band. Some points, especially in the upper right tail, deviate from this line, suggesting potential outliers.

Stepwise models

If we have no theoretical basis for presuming a particular order of importance for our variables, we can use a stepwise model to automatically select the optimal number and order of variables. Note that this is common in educational data mining and other related fields, but is sometimes controversial. Basically, it is important to only include variables that you have a good reason to believe (e.g., theoretically or based on previous evidence) are appropriate predictors of the dependent variable.

Below, we will conduct a stepwise regression using the MASS package (don’t forget to install it!). Note that the first argument of stepAIC() is a model (e.g., lm(Score ~ frequency_CW + MLC + frequency_FW, data = mr_data)). Here, we are using our first model (model1).

#install.packages("MASS") #if not already installed
library(MASS)
stepwise_model <- stepAIC(model1, direction="both")
## Start:  AIC=-171.18
## Score ~ frequency_CW + MLC + frequency_FW
## 
##                Df Sum of Sq    RSS     AIC
## - MLC           1    0.2158 330.68 -172.86
## <none>                      330.47 -171.18
## - frequency_FW  1    4.2091 334.67 -167.10
## - frequency_CW  1   20.6682 351.13 -144.06
## 
## Step:  AIC=-172.86
## Score ~ frequency_CW + frequency_FW
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      330.68 -172.86
## + MLC           1    0.2158 330.47 -171.18
## - frequency_FW  1    6.5678 337.25 -165.42
## - frequency_CW  1   28.9601 359.64 -134.57
summary(stepwise_model)
## 
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40074 -0.50240 -0.06091  0.61036  1.94582 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.8934     1.8990   3.103  0.00203 ** 
## frequency_CW  -2.4295     0.3759  -6.463 2.53e-10 ***
## frequency_FW   1.0631     0.3454   3.078  0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8326 on 477 degrees of freedom
## Multiple R-squared:  0.1227, Adjusted R-squared:  0.1191 
## F-statistic: 33.37 on 2 and 477 DF,  p-value: 2.729e-14

As we see, the stepwise model dropped MLC, just like we did, and ended up with the same model as we saw in model2.

Further directions

When predicting scores, it is often important to ensure that our model is generalizable beyond the current dataset. This can be accomplished using training and test sets and/or through cross-validation.

Training and test sets

This procedure involves randomly dividing a datset into two pieces (often via a 2/3rds, 1/3rd split). First, a model is created using the larger set, then tested on new data (the smaller set). See below for an example.

Note that this uses the caret package, so don’t forget to install it.

#install.packages("caret") #if not already installed
library(caret)
set.seed(1) #do this to ensure reproducibility - otherwise you will get slightly different results each time
#create random sample based on Score values, training set will be 67% of data
trainIndex = createDataPartition(mr_data$Score, p = .67, list = FALSE, times = 1)
#define training data
mr_dataTrain <- mr_data[trainIndex,]
#define test data
mr_dataTest <- mr_data[-trainIndex,]
#create model using test data:
train_model <- lm(Score ~ frequency_CW + frequency_FW, data = mr_dataTrain)
#get training set results
summary(train_model)
## 
## Call:
## lm(formula = Score ~ frequency_CW + frequency_FW, data = mr_dataTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07644 -0.50199 -0.07987  0.58978  1.87751 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.3241     2.3813   1.396  0.16371    
## frequency_CW  -1.9190     0.4620  -4.154  4.2e-05 ***
## frequency_FW   1.3688     0.4311   3.175  0.00164 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8348 on 320 degrees of freedom
## Multiple R-squared:  0.1076, Adjusted R-squared:  0.102 
## F-statistic: 19.28 on 2 and 320 DF,  p-value: 1.239e-08

Using the smaller dataset for training, we see that the model is slightly less predictive than our original model2 (which used the whole dataset).

Now, we can apply this model to the test set, and evaluate its performance.

#predict scores in test data and save results as the variable "test_pred"
mr_dataTest$test_pred <- predict(train_model,mr_dataTest)
#determine the relationship between the actual and predicted scores in the test data
cor.test(mr_dataTest$Score,mr_dataTest$test_pred)
## 
##  Pearson's product-moment correlation
## 
## data:  mr_dataTest$Score and mr_dataTest$test_pred
## t = 5.1419, df = 155, p-value = 8.1e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2394076 0.5079940
## sample estimates:
##       cor 
## 0.3817307
#get R-squared value
cor(mr_dataTest$Score,mr_dataTest$test_pred)^2
## [1] 0.1457183

As we see from the results, the model from the training data actually worked better (r-squared = .130) on the test set than the training set (r-squared = .112)! This provides some evidence that our model can be generalized to new (but similar) datasets.

Cross-validation

Cross-validation involves using multiple training and test sets, then averaging the model performance across all test sets. Commonly, the data is divided into ten segments (though this can vary widely!). In the case of ten segments, the data from 9 segments is used to train the model, which is then tested on the tenth segment. This process is repeated until all segments have been used as the test set, and which time the results from each test set are averaged. Cross-validation is popular because it allows for larger training sets (and avoids some of the pitfalls of random selection).

#library(caret)
set.seed(1) #do this to ensure reproducibility - otherwise you will get slightly different results each time
#create cross-validation parameters
TenFoldCV <- trainControl(method = "repeatedcv", number = 10,repeats = 10)
#assign formula from stepwise model (this was the model defined earlier)
step_mod<-formula(stepwise_model)
#create cross-validated model
mTenFold<-train(step_mod,data=mr_data,method='lm',trControl=TenFoldCV)
#get results
mTenFold$results
##   intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD    MAESD
## 1      TRUE 0.8319323 0.1358885 0.6769704 0.06143786 0.09085329 0.050142

Here, we are most interested in the Rsquared value, which indicates that the model explained 13.7% of the variance in Score (Rsquared = 0.137). This is not markedly different from the full set model, which indicates that our original model was stable across the dataset.