Linear Mixed-Effects (LME) Models

Linear mixed effects models are mathematically and conceptually related to a linear regression (and accordingly to an ANOVA). The biggest difference between and LME and a linear regression is that an LME can adjust the line of best fit based on trajectories of particular individuals (or groups). In this tutorial, we will use linear mixed-effects models to examine the relationship between time spent learning English as an L2 and writing development (measured via an index of syntactic complexity).

We will first treat Time as a categorical variable to see the degree to which changes in syntactic complexity occur between data collection points. This analysis is related to a repeated measures ANOVA both conceptually and mathematically. However, an LME is arguably more precise than an ANOVA because it takes into account the individual trajectories of each participant.

We will then treat Time as a continuous variable to see the degree to which changes in syntactic complexity are linear (i.e., follow a reasonably consistent trajectory).

Description of sample data

This data comprises L2 English essays written over a two year period by nine middle-school aged Dutch children studying at an English/Dutch bilingual school in the Netherlands. Essays were collected three times a year (every four months) over two academic years. Included in the dataset are holistic scores for each essay (“Score”) and mean length of T-unit (MLT) values. In this tutorial, we will explore the relationship between MLT and time spent studying English, with the alternative hypothesis that MLT scores will increase as a function of time. For further reference, see Kyle, Crossley, & Verspoor (2021) and Kyle (2016). In other words, we will be attempting to determine whether (and the degree to which) Dutch EFL middle school students write more words per T-unit (a T-unit is an independent clause and all connected depedent clauses) as a function of the time they spend studying English.

mydata <- read.csv("data/RM_sample.csv", header = TRUE)
#First, we create a new variable that is the categorical version of Time
mydata$FTime <- factor(mydata$Time)
summary(mydata)

##  Participant             Time         Score           MLT         FTime
##  Length:54          Min.   :1.0   Min.   :1.00   Min.   : 6.895   1:9  
##  Class :character   1st Qu.:2.0   1st Qu.:3.00   1st Qu.: 9.438   2:9  
##  Mode  :character   Median :3.5   Median :4.00   Median :10.976   3:9  
##                     Mean   :3.5   Mean   :4.13   Mean   :11.517   4:9  
##                     3rd Qu.:5.0   3rd Qu.:5.00   3rd Qu.:12.906   5:9  
##                     Max.   :6.0   Max.   :7.00   Max.   :18.889   6:9

Look at descriptive statistics

library(psych)
#get descriptives for the MLT at each time point. "group1" is Ftime
describeBy(x=mydata$MLT, group=mydata$FTime, mat=TRUE, digits=3 )

##     item group1 vars n   mean    sd median trimmed   mad    min    max  range
## X11    1      1    1 9  9.697 2.043  9.792   9.697 2.162  6.895 12.750  5.855
## X12    2      2    1 9 10.364 1.654 10.059  10.364 1.693  8.476 12.824  4.347
## X13    3      3    1 9 11.474 3.779 10.357  11.474 2.848  8.158 18.250 10.092
## X14    4      4    1 9 11.404 2.719 10.600  11.404 1.291  8.579 16.182  7.603
## X15    5      5    1 9 13.321 2.556 13.438  13.321 1.062 10.312 18.889  8.576
## X16    6      6    1 9 12.843 1.811 11.857  12.843 1.596 10.625 16.100  5.475
##      skew kurtosis    se
## X11 0.259   -1.423 0.681
## X12 0.367   -1.605 0.551
## X13 0.794   -1.128 1.260
## X14 0.793   -1.043 0.906
## X15 0.793   -0.101 0.852
## X16 0.504   -1.351 0.604

Looking at trends in the data through visualization

First, we can look at the means at each time point. Overall, there appears to be a positive trend (i.e., more words per T-unit at later essay collection points), but this is not perfectly linear.

library(ggplot2)
library(viridis) #color-friendly palettes

g1 <- ggplot(data = mydata, aes(x = FTime, y = MLT, group = Time)) +
  geom_boxplot()

#print(g1)

Boxplot showing score distribution across six time points. Each box represents the interquartile range, which contains the horizontal line within each box marks the median score. The whiskers extend to the furthest points that are not considered outliers. Potential outliers are represented as individual black dots outside the whiskers.

Then, we can get a clearer view by looking at individual trajectories. In the plot below, each line represents a different participant.

library(ggplot2)

# customizing colors
colors <- c("orange", "#56B4E9", "#009E73", "gold2", "#0072B2", "purple", "#CC79A7", "#999999", "#000000") 

g2 <- ggplot(data = mydata, aes(x = FTime, y = MLT, group = Participant)) +
  geom_line(aes(color=Participant)) +
  geom_point(aes(color=Participant)) +
  scale_color_manual(values = colors) 
  
#print(g2)

Sometimes, we get a clearer view of individual trajectories when we use facet wrap. As we can see, participants seem to be generally following an upward trend, but there are also non-linear trends (see EFL_6 and EFL 7 at week 3!).

g3 <- ggplot(data = mydata, aes(x = FTime, y = MLT, group = Participant)) +
  geom_line(aes(color=Participant)) +
  geom_point(aes(color=Participant)) +
  scale_color_manual(values = colors) +
  facet_wrap(~Participant)

#print(g3)

A faceted line chart displaying individual score trajectories for nine participants across six time points. The chart is divided into nine panels, one for each participant, showing the progression of scores from time point 1 to 6.

Conducting an LME

Linear mixed effects models allow us to account for both fixed effects (these are the variables we are most interested in, such as time spent studying English) and random effects, which are variables that may affect our results, but are not the main variables of interest.

In this tutorial, we are primarily interested in time spent studying English (Time and/or FTime). However, many researchers have highlighted the idea that individuals take varying paths on their way to proficiency, so individual variation may affect our results. This is our random effect, because we want to account for individual variation, but it isn’t our primary variable of interest.

Assumptions

Linear mixed effects (lme) models are relatively new (particularly when compared to statistics such as t-tests). They are also particularly flexible. Due to these two issues, there is not a fully accepted consensus as to the assumptions for running an lme. However we will check the following two assumptions, which are the same for linear regression models:

the relationship between the residuals (the model error for each data point) and the variable being predicted should be linear
the residuals should be normally distributed

Note that to check residuals we will have to run our models first. So, we will (a bit paradoxically) check our assumptions AFTER running our models.

Random intercepts

As noted above, our random effects are our participants (i.e., our main interest is not in inter-participant variation, but we still need to control for it). In the first model we will run, we will presume that all of our participants will follow similar trajectories, but may have different starting points (intercepts). In other words, the model will presume that participants will increase their MLT scores at approximately the same rate, but may have be at different writing proficiencies (at least with regard to MLT) at the beginning of the study. Also, first, we will run the analysis with time as a categorical variable (FTime), like we would with a repeated measures ANOVA. Then, we will run it as a continuous variable to show the difference.

The most “appropriate” choice (treating Time as a factor or a continuos variable) will depend on our goals. If we want to see WHERE changes occurred (e.g., across particular time points), then we will want to treat time as a categorical variable. If we want to see the strength of the relationship (i.e., how strongly [and linearly] related Time and MLT scores are), then we will want to treat Time as a continous variable.

library(lmerTest)

int_model1 <- lmer(MLT~FTime + (1 | Participant), data=mydata)
summary(int_model1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MLT ~ FTime + (1 | Participant)
##    Data: mydata
## 
## REML criterion at convergence: 235.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1754 -0.6463 -0.2230  0.4889  2.8861 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Participant (Intercept) 1.214    1.102   
##  Residual                5.186    2.277   
## Number of obs: 54, groups:  Participant, 9
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   9.6968     0.8432 40.6842  11.500 2.32e-14 ***
## FTime2        0.6667     1.0735 40.0000   0.621  0.53808    
## FTime3        1.7767     1.0735 40.0000   1.655  0.10574    
## FTime4        1.7068     1.0735 40.0000   1.590  0.11973    
## FTime5        3.6242     1.0735 40.0000   3.376  0.00165 ** 
## FTime6        3.1460     1.0735 40.0000   2.931  0.00557 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr) FTime2 FTime3 FTime4 FTime5
## FTime2 -0.637                            
## FTime3 -0.637  0.500                     
## FTime4 -0.637  0.500  0.500              
## FTime5 -0.637  0.500  0.500  0.500       
## FTime6 -0.637  0.500  0.500  0.500  0.500

Note that the “intercept” here presumes that FTime = 1, and all of the data points are in reference to this. So, from FTime 1 to FTime 2, there is a .6667 increase in words per T-unit (which is not significant). Additionally, from FTime 1 to FTime 3, there is a 1.7767 increase in words per T-unit (which is not significant). We don’t get a significant change until we get to FTime 5.

The Random Effects section above indicates that there is indeed variation across participants at each time point (the standard deviation is 1.102 words per T-unit).

We can also see the degree to which our estimates need to change for each participant (i.e., we can see our individualized random effects).

ranef(int_model1)

## $Participant
##       (Intercept)
## EFL_1  -0.3623567
## EFL_2   1.2929820
## EFL_3   0.4205164
## EFL_4  -0.6398552
## EFL_5  -1.3619195
## EFL_6   0.2041524
## EFL_7   0.4789303
## EFL_8   0.7436074
## EFL_9  -0.7760572
## 
## with conditional variances for "Participant"

Again, this is a random intercepts model, so the only changes we will see is in the intercept (i.e., starting point). All other estimates will be in relation to these revised intercepts. So, for example, at Time 1, EFL_1’s estimated MLT score will be 9.6968 - 0.3623567 = 9.334443, while EFL_2’s estimated MLT score will be 9.6968 + 1.2929820 = 10.98978. But remember, these are estimates created under the assumption that all participants will have the same slope (or rate of change), and we can see from looking at our plots that this estimate is decent for EFL_1, but rather poor for EFL_2.

Checking assumptions

Now that we have a model, we can check the assumption of linearity and normality.

First we will test for linearity (wherein the relationship between our actual MLT values and our predicted MLT values should roughly comprise a straight line as opposed to a curved one!). As we see below, the model appears to satisfy the assumption of linearity.

plot(resid(int_model1), mydata$MLT)

A scatter plot of residuals from a model against the original MLT values. The points are spread out across the plot.

Now, we will check the assumption of normality (of residuals) using a qq plot. A perfectly normal distribution of residuals would be when all data points fall perfectly on the line. As we can see, we have some outliers, but the distribution is (arguably) roughly normal.

#install.packages("lattice") #if not already installed
library(lattice)

qqmath(int_model1)

A quantitative normal Q-Q plot displaying standardized residuals plotted against theoretical standard normal quantiles. The data points are represented by blue circles and close to a diagonal line.

Model effect size

Now that we have checked (and at least arguably met) assumptions, we can proceed to computing effect sizes.

#install.packages("MuMIn") #if not alread installed
library(MuMIn)
r.squaredGLMM(int_model1)

##            R2m      R2c
## [1,] 0.2042064 0.355122

The “MuMIN” package gives us two R^2 values. The first, R2m (marginal r-squared), is the effect size without considering the random effects (i.e., where the model is built without regard to participant differences). The second, R^2c (conditional r-squared) is the effect size when the random effects are taken into account. Accordingly, the R^2c values are aways higher than the R^2m values (unless there are little/no random effects).

The R^2m value here indicate that FTime accounts for 20.4% variance in MLT scores overall. The R^2c value indicates that participant variation + FTime together account for 35.5% of the variance in MLT scores.

Getting Pairwise Comparisons

Below, we use the package “emmeans” to get pairwise comparisons from our model (in case we want them). The emmeans packages is very flexible, so it is probably worthwhile to read the documentation on the package. As we can see below (look at the comparisons starting below the “$contrasts” line), the only significant pairwise difference (assuming an alpha level of .05) is between Time 1 and Time 5.

#install.packages("emmeans") #if not already installed
library(emmeans)
int_model1.emm <- emmeans(int_model1, specs = pairwise~FTime)
summary(int_model1.emm)

## $emmeans
##  FTime emmean    SE   df lower.CL upper.CL
##  1        9.7 0.843 40.7     7.99     11.4
##  2       10.4 0.843 40.7     8.66     12.1
##  3       11.5 0.843 40.7     9.77     13.2
##  4       11.4 0.843 40.7     9.70     13.1
##  5       13.3 0.843 40.7    11.62     15.0
##  6       12.8 0.843 40.7    11.14     14.5
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast        estimate   SE df t.ratio p.value
##  FTime1 - FTime2  -0.6667 1.07 40  -0.621  0.9888
##  FTime1 - FTime3  -1.7767 1.07 40  -1.655  0.5683
##  FTime1 - FTime4  -1.7068 1.07 40  -1.590  0.6097
##  FTime1 - FTime5  -3.6242 1.07 40  -3.376  0.0191
##  FTime1 - FTime6  -3.1460 1.07 40  -2.931  0.0579
##  FTime2 - FTime3  -1.1100 1.07 40  -1.034  0.9036
##  FTime2 - FTime4  -1.0400 1.07 40  -0.969  0.9251
##  FTime2 - FTime5  -2.9575 1.07 40  -2.755  0.0865
##  FTime2 - FTime6  -2.4792 1.07 40  -2.309  0.2142
##  FTime3 - FTime4   0.0699 1.07 40   0.065  1.0000
##  FTime3 - FTime5  -1.8475 1.07 40  -1.721  0.5265
##  FTime3 - FTime6  -1.3692 1.07 40  -1.275  0.7963
##  FTime4 - FTime5  -1.9174 1.07 40  -1.786  0.4858
##  FTime4 - FTime6  -1.4392 1.07 40  -1.341  0.7609
##  FTime5 - FTime6   0.4783 1.07 40   0.446  0.9976
## 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 6 estimates

Time as a Continuous Variable

We can also run our analysis with Time as a continuous variable (the distance between each Time value is roughly 4-months). In this case, we will be testing whether the line of best fit is a reasonable approximation of our data. Again, at this point, we are still working with a random intercepts model.

First, we generate descriptive statistics (treating time as a continuous variable)

#load date and packages if not already loaded
#mydata <- read.csv("data/RM_sample.csv", header = TRUE)
#library(MuMIn)
#library(lattice)
#library(lmerTest)
#library(ggplot2)
#library(psych)
describe(mydata)

##              vars  n  mean   sd median trimmed  mad  min   max range  skew
## Participant*    1 54  5.00 2.61   5.00    5.00 2.97 1.00  9.00  8.00  0.00
## Time            2 54  3.50 1.72   3.50    3.50 2.22 1.00  6.00  5.00  0.00
## Score           3 54  4.13 1.21   4.00    4.16 1.48 1.00  7.00  6.00 -0.18
## MLT             4 54 11.52 2.73  10.98   11.27 2.70 6.89 18.89 11.99  0.74
## FTime*          5 54  3.50 1.72   3.50    3.50 2.22 1.00  6.00  5.00  0.00
##              kurtosis   se
## Participant*    -1.29 0.35
## Time            -1.33 0.23
## Score           -0.17 0.17
## MLT              0.03 0.37
## FTime*          -1.33 0.23

int_model2 <- lmer(MLT~Time + (1 | Participant), data=mydata)
summary(int_model2) #get model summary

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MLT ~ Time + (1 | Participant)
##    Data: mydata
## 
## REML criterion at convergence: 247
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.2466 -0.7065 -0.2796  0.3919  3.0939 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Participant (Intercept) 1.256    1.121   
##  Residual                4.934    2.221   
## Number of obs: 54, groups:  Participant, 9
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   9.0637     0.7840 37.7386   11.56 5.76e-14 ***
## Time          0.7009     0.1770 44.0000    3.96  0.00027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Time -0.790

ranef(int_model2) #get slope adjustments per participant

## $Participant
##       (Intercept)
## EFL_1  -0.3748968
## EFL_2   1.3377286
## EFL_3   0.4350693
## EFL_4  -0.6619988
## EFL_5  -1.4090517
## EFL_6   0.2112175
## EFL_7   0.4955048
## EFL_8   0.7693416
## EFL_9  -0.8029144
## 
## with conditional variances for "Participant"

r.squaredGLMM(int_model2) #get effect sizes

##            R2m       R2c
## [1,] 0.1908595 0.3550029

This indicates that Time is a significant predictor of MLT scores (p = .00027) with medium effects (R2^m = 0.191, R^2c = 0.355). Note below, we add the predicted scores for each participant to our dataframe using the “mutate” function of the package “dplyr”. Then, we plot the predicted line (dashed) on top of the actual data. We see that each line has the same slope, but different intercepts.

library(dplyr)
#add predicted values to dataframe
mydata <- mydata %>% mutate(int_model2_pred = predict(int_model2,re.form = ~ (1 | Participant)))

g6 <- ggplot(mydata, aes(y=MLT, x=Time)) +
  facet_wrap(~ Participant) + 
  geom_point(aes(color=Participant), show.legend = FALSE) + 
  geom_line(aes(color=Participant), show.legend = FALSE) +
  geom_line(aes(y=int_model2_pred), linetype=2) +
  scale_color_manual(values = colors) +
  facet_wrap(~Participant)

#print(g6)

A multi-panel line graph showing the relationship between MLT and Time for nine participants, each in a separate panel. Each panel shows two lines: one representing the observed MLT values, marked with points, and another of predicted MLT values from a model, shown as a dashed line. The colors of the lines vary for each participant.

These results indicate that there is a significant, positive, medium relationship between Time and MLT. The results also indicate that there is a fairly high degree of variability across participants.

Write up

A linear mixed effects model with random intercepts was run using Time to predict the mean length of T-unit (MLT) in the participant essays. Descriptive statistics for each variable can be found in Table 1.

Table 1. Descriptive Statistics for Time and MLT

Index	n	Mean	Standard Deviation
Time	54	3.500	1.720
MLT	54	11.520	0.11

The results of the analysis indicated that was a significant positive relationship between Time and MLT in the participant essays (p < .001, R2m = .191, R2c = .355). Table 2 comprises the estimates, standard errors, t values and p values for the fixed effect of Time. The results indicate that the participants’ wrote longer T-units over the two-year time period, at a rate of approximately .7 words per T-unit at progressive collection points. Approximately 19% of the variance in MLT scores was be explained by Time spent studying English, while approximately 16.5% of the the variance in MLT scores was explained by the random effects (variation across participants). Figure 1 includes plots of the MLT scores for each participant at each collection point with the regression lines (indicated by the dashed line) produced by the model.

Table 2.
LME model predicting mean length of T-unit

Fixed Effects	Estimate	SE	t	p
(Intercept)	9.064	0.784	11.560	< 0.001
Time	0.701	0.177	3.960	< 0.001

Figure 1. Predicted and actual MLT scores over time spent studying English

Random intercepts and random slopes

Feel free to stop here in the tutorial! The rest of the information is for those who are particularly interested in this topic!

Above, we fit linear models with random intercepts, but fixed slopes. By doing so, we forced the model to assume that all individuals will follow the same slopes (i.e., increase/decrease the number of words per T-unit at the same rate over time).

LMEs are quite flexible, however, and also allow us to fit more complicated models. For example, we can tell the model to adjust both the intercept (i.e., the starting MLT scores) and the slope (i.e., the rate and direction of change). More complicated models are often more difficult to interpret, but usually will account for more variance than simpler models. In general, we want to use models that optimize BOTH explanatory power and simplicity (more on this later).

Below, we will run a model with random intercepts AND random slopes, and then evaluate whether the more complicated model is reasonably better than the simpler one. We will almost always get a more accurate model when we add complexity, but the comlexity may not always be justified (e.g., it is a lot more difficult to explain a random intercepts and slopes model, and we only want to do so if it significantly increases our explanatory power). Note that we could treat Time either as a categorical variable or as a continuos variable. However, below we will only look at the continuous model.

int_slope_model <- lmer(MLT~Time + (1 + Time | Participant), data=mydata)
summary(int_slope_model) #get model summary

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MLT ~ Time + (1 + Time | Participant)
##    Data: mydata
## 
## REML criterion at convergence: 246.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1611 -0.7200 -0.2494  0.3593  3.2488 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev. Corr
##  Participant (Intercept) 0.00000  0.0000       
##              Time        0.09418  0.3069    NaN
##  Residual                4.77437  2.1850       
## Number of obs: 54, groups:  Participant, 9
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   9.0637     0.6781 43.9396  13.367  < 2e-16 ***
## Time          0.7009     0.2019 18.2750   3.471  0.00268 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Time -0.775
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

As we see above, our results look almost identical to the previous model (int_model_2). The only difference is that we have statistics for one more random effect (Time) - our slope. Interestingly, once we added a random slope to the model, the variance in intercept vanished. In other words, the optimal model used the same intercept for all participants, but different slopes.

Below, we get by participant effects, which now include adjustments for slope AND intercept (though in this case the intercept is the same for all participants). We also see that our R^2m is almost identical, while our R^2c has increased slightly.

ranef(int_slope_model)

## $Participant
##       (Intercept)         Time
## EFL_1           0 -0.045129524
## EFL_2           0  0.472795689
## EFL_3           0  0.069918991
## EFL_4           0 -0.251559209
## EFL_5           0 -0.332811778
## EFL_6           0 -0.005069221
## EFL_7           0  0.077786974
## EFL_8           0  0.198346542
## EFL_9           0 -0.184278463
## 
## with conditional variances for "Participant"

r.squaredGLMM(int_slope_model)

##            R2m       R2c
## [1,] 0.1905282 0.3769378

When we plot our actual and predicted values, we now see that each participant has a slightly different slope. However, in this case, the differences are not extreme.

library(dplyr)
mydata <- mydata %>% mutate(int_slope_model_pred = predict(int_slope_model,re.form = ~ (1 | Participant)))

g7 <- ggplot(mydata, aes(y=MLT, x=Time)) +
  facet_wrap(~ Participant) + 
  geom_point(aes(color=Participant), show.legend = FALSE) + 
  geom_line(aes(color=Participant), show.legend = FALSE) +
  geom_line(aes(y=int_slope_model_pred), linetype=2) +
  scale_color_manual(values = colors) +
  scale_y_continuous()

#print(g7)

To determine whether the more complicated model (random intercepts + slopes) is significantly better at modeling our data than the simpler one (random intercepts), we simply use the “anova()” function, which indicates that there are indeed no significant differences between our models (p = 0.648), so we might as well use the simpler one.

anova(int_model2,int_slope_model)

## Data: mydata
## Models:
## int_model2: MLT ~ Time + (1 | Participant)
## int_slope_model: MLT ~ Time + (1 + Time | Participant)
##                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## int_model2         4 253.71 261.66 -122.85   245.71                     
## int_slope_model    6 256.84 268.77 -122.42   244.84 0.8668  2     0.6483

Linear Mixed-Effects Models (LME) for Time Series Data

Kristopher Kyle

updated 2022-02-09