Back to Homepage

Regression: Predicting dependent variable values based on independent variable values

To fully understand the information on this page, be sure to read the previous tutorial on correlation.

Regression is used for value prediction. For example, lets imagine that we have a set of writing proficiency scores (dependent variable) and corresponding mean word frequency values (which are a measure of lexical sophistication; this would be an independent variable). Regression allows us to predict the writing proficiency scores based on the mean frequency values. The stronger the relationship is between the two values, the better (more accurate) the predictions will be.

Conceptually (and mathematically), regression is related to correlation. The main difference is that the results of a regression include more information, namely the characteristics of the line of best fit through the data.

In this tutorial, we are going to use average word frequency scores (frequency_CW) to predict scores for nwords (which we are using as a [flawed] proxy for proficiency) in a corpus of learner argumentative essays.

Assumptions

The assumptions for (single) linear regression are almost the same as for correlations.

The main assumptions of (single) linear regression are:

  • The variables must be continuous (see other tests for ordinal or categorical data)

  • The variables must have a linear relationship

  • There are no outliers (or there are only minimal outliers in large samples)

  • The residuals (i.e., the prediction errors) are normally distributed

Checking assumptions

First, we will load our data:

library(ggplot2) #load ggplot2
library(viridis) #color-friendly palettes

cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe
summary(cor_data) #get descriptive statistics for the dataset
##      Score        pass.fail            Prompt              nwords     
##  Min.   :1.000   Length:480         Length:480         Min.   : 61.0  
##  1st Qu.:3.000   Class :character   Class :character   1st Qu.:273.0  
##  Median :3.500   Mode  :character   Mode  :character   Median :321.0  
##  Mean   :3.427                                         Mean   :317.7  
##  3rd Qu.:4.000                                         3rd Qu.:355.2  
##  Max.   :5.000                                         Max.   :586.0  
##   frequency_AW    frequency_CW    frequency_FW   bigram_frequency
##  Min.   :2.963   Min.   :2.232   Min.   :3.598   Min.   :1.240   
##  1st Qu.:3.187   1st Qu.:2.656   1st Qu.:3.827   1st Qu.:1.440   
##  Median :3.237   Median :2.726   Median :3.903   Median :1.500   
##  Mean   :3.234   Mean   :2.723   Mean   :3.902   Mean   :1.500   
##  3rd Qu.:3.284   3rd Qu.:2.789   3rd Qu.:3.975   3rd Qu.:1.559   
##  Max.   :3.489   Max.   :3.095   Max.   :4.235   Max.   :1.755
library(psych)
describe(cor_data) #get descriptive statistics
##                  vars   n   mean    sd median trimmed   mad   min    max  range
## Score               1 480   3.43  0.89   3.50    3.43  0.74  1.00   5.00   4.00
## pass.fail*          2 480   1.54  0.50   2.00    1.54  0.00  1.00   2.00   1.00
## Prompt*             3 480   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.00
## nwords              4 480 317.65 78.28 321.00  316.93 57.82 61.00 586.00 525.00
## frequency_AW        5 480   3.23  0.07   3.24    3.23  0.07  2.96   3.49   0.53
## frequency_CW        6 480   2.72  0.11   2.73    2.72  0.10  2.23   3.10   0.86
## frequency_FW        7 480   3.90  0.11   3.90    3.90  0.11  3.60   4.23   0.64
## bigram_frequency    8 480   1.50  0.09   1.50    1.50  0.09  1.24   1.75   0.52
##                   skew kurtosis   se
## Score             0.00    -0.62 0.04
## pass.fail*       -0.14    -1.98 0.02
## Prompt*           0.00    -2.00 0.02
## nwords            0.12     1.10 3.57
## frequency_AW     -0.10     0.54 0.00
## frequency_CW     -0.14     1.08 0.00
## frequency_FW      0.10    -0.21 0.01
## bigram_frequency  0.02    -0.11 0.00

Our variables of interest (frequency_CW) and number of words (nwords) are both continuous variables, so we meet the first assumption.

Next, we will test the assumption of linearity using a Loess line (in red) and a line of best fit (in blue). These indicate that our data are roughly linear (because the red line mirrors the blue one). Note that this is the same data from the correlation tutorial.

g1 <- ggplot(cor_data, aes(x = frequency_CW , y=nwords )) +
  geom_point() +
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  scale_color_viridis(discrete = TRUE) +  
  theme_minimal() 

#print(g1)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The gray shaded area around each line indicates the confidence interval for that regression.

Looking again at the scatterplot, we see that there are a few outliers. However, given the size of our dataset, these shouldn’t be a problem.

We will check the distribution of the residuals AFTER running the regression (we have to run the regression to get the residuals).

Preliminary correlation analysis

See previous tutorial on correlation for more details on this preliminary analysis.

cor.test(cor_data$nwords,cor_data$frequency_CW)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_data$nwords and cor_data$frequency_CW
## t = -4.5377, df = 478, p-value = 7.201e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2874922 -0.1158270
## sample estimates:
##        cor 
## -0.2032207

Conducting a linear regression

Linear regression is very easy to do in R. We simply use the lm() function:

#define model1 as a regression model using frequency_CW to predict nwords
model1 <- lm(nwords ~ frequency_CW, data = cor_data)

We then can see the model summary using the summary() function:

#define model1 as a regression model using frequency_CW to predict nwords
summary(model1)
## 
## Call:
## lm(formula = nwords ~ frequency_CW, data = cor_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -245.502  -42.924    3.102   40.703  253.254 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    729.93      90.92   8.028 7.75e-15 ***
## frequency_CW  -151.42      33.37  -4.538 7.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.73 on 478 degrees of freedom
## Multiple R-squared:  0.0413, Adjusted R-squared:  0.03929 
## F-statistic: 20.59 on 1 and 478 DF,  p-value: 7.201e-06

There is a wealth of information provided by this output. For the purposes of this tutorial, we are interested in five particular pieces, including the Estimate(s), the Std. Error, the p-value, the Multiple R-squared, and the Adjusted R-squared. Each of these are described below.

Estimate(s)

The estimate(s) provide the information needed to construct the regression line (and therefore predict nwords values based on frequency_CW values).

The intercept estimate value (in this case, 729.93) indicates the value for nwords when frequency_CW = 0 (this is the y - intercept).

The frequency_CW estimate indicates the slope of the line of best fit. For every decrease of 151.41 in nwords, frequency_CW values will increase by 1.

For any frequency_CW value, we can predict the nwords value using the following formula: nwords = (frequency_CW value * -151.42) + 729.93. So, if our frequency value is 2.8, we will predict that nwords will be 305.954 (see below for this calculation)

(2.8*-151.42) + 729.93
## [1] 305.954

Standard Error (Std. Error)

The standard error figures indicate the standard deviation in the predicted values. Higher standard error indicates a less optimal model. The standard error for the intercept is 90.92, and the standard error for frequency_CW is 33.37.

Probability value (p-value)

This indicates the likelihood of observing this data if there were no relationship between our two variables. In this case, the p=value is quite low (7.201e-06, or .000007201, or .0007201%).

Multiple R-squared

This is the effect size for regressions. It is the squared value of the correlation coefficient, and indicates the amount of shared variance between the two variables. In this case, frequency_CW explains 4.13% of the variance in nwords scores (R-squared = .0413), which is quite small.

Adjusted R-squared

This is the effect size adjusted for the number of predictor variables used. As we increase variables, the adjusted R-squared will be progressively lower than the unadjusted value. In this case, our Adjusted R-squared value is .03929.

Checking the assumption of normality (of residuals)

Now that we have run the model, we can check the assumption of normality (of the residuals) using a qq plot. A perfectly normal distribution of residuals would be when all data points fall perfectly on the line. As we can see, we have some outliers (particularly on the right side of the graph), but the distribution is roughly normal.

library(car)
qqPlot(model1)

A quantitative normal Q-Q plot displaying standardized residuals plotted against theoretical standard normal quantiles. The data points are represented by black circles mostly align with the blue diagonal line. The blue shaded area around the line shows a confidence band. Some points, especially in the upper right tail, deviate from this line, suggesting potential outliers.

Example write-up

This study examined the relationship between lexical sophistication (measured as the mean frequency of content words) and writing proficiency (imperfectly! measured as number of words per essay) in a corpus of argumentative essays written by L2 users of English. Descriptive statistics are reported in Table 1.

Table 1.
Descriptive statistics for indices used in this study

Index n Mean Standard Deviation
number of words 480 317.65 78.28
Frequency CW 480 2.72 0.11

The data met the assumptions of Pearson correlations and linear regressions. Therefore, a Pearson correlation between number of words in each essay and content word frequency to determine the strength of the relationship between the two indices. The results of the Pearson correlation, which indicated a small, negative relationship between the two variables ( r = -0.203, p < .001), are reported in Table 2.

Table 2.
Results of the Pearson correlation analysis with number of words

Index n r p
Frequency CW 480 -0.203 < .001

Finally, a linear regression was conducted to determine the degree to which content word frequency could predict the number of words in an essay (as a proxy for proficiency). The results, which are reported in Table 3, indicated that content word frequency explained 4% of the variance in number of words ( R^2 = 0.041). See Figure 1 for a visualization of the results.

Table 3.
Results of the linear regression analysis

Coefficients R^2 Adj R^2 B SE p
(Intercept) 729.93 90.92 < .001
Frequency CW 0.041 0.039 -151.42 33.37 < .001

Scatter plot showing black points and a smooth line running through the data points. The line shows a line of best fit. The gray shaded area around the line indicates the confidence interval for that regression.

Figure 1. Predicted versus actual scores

Obligatory Stats Joke

Stayed tuned: Multiple Regression

In an upcoming tutorial, we will look at using multiple independent variables to predict the values of a dependent variable.