Back to Homepage

Linear Mixed-Effects (LME) Models

Linear mixed effects models are mathematically and conceptually related to a linear regression (and accordingly to an ANOVA). The biggest difference between and LME and a linear regression is that an LME can adjust the line of best fit based on trajectories of particular individuals (or groups). In this tutorial, we will use linear mixed-effects models to examine the relationship between time spent learning English as an L2 and writing development (measured via an index of syntactic complexity).

We will first treat Time as a categorical variable to see the degree to which changes in syntactic complexity occur between data collection points. This analysis is related to a repeated measures ANOVA both conceptually and mathematically. However, an LME is arguably more precise than an ANOVA because it takes into account the individual trajectories of each participant.

We will then treat Time as a continuous variable to see the degree to which changes in syntactic complexity are linear (i.e., follow a reasonably consistent trajectory).

Description of sample data

This data comprises L2 English essays written over a two year period by nine middle-school aged Dutch children studying at an English/Dutch bilingual school in the Netherlands. Essays were collected three times a year (every four months) over two academic years. Included in the dataset are holistic scores for each essay (“Score”) and mean length of T-unit (MLT) values. In this tutorial, we will explore the relationship between MLT and time spent studying English, with the alternative hypothesis that MLT scores will increase as a function of time. For further reference, see Kyle, Crossley, & Verspoor (2021) and Kyle (2016). In other words, we will be attempting to determine whether (and the degree to which) Dutch EFL middle school students write more words per T-unit (a T-unit is an independent clause and all connected depedent clauses) as a function of the time they spend studying English.

mydata <- read.csv("data/RM_sample.csv", header = TRUE)
#First, we create a new variable that is the categorical version of Time
mydata$FTime <- factor(mydata$Time)
summary(mydata)
##  Participant             Time         Score           MLT         FTime
##  Length:54          Min.   :1.0   Min.   :1.00   Min.   : 6.895   1:9  
##  Class :character   1st Qu.:2.0   1st Qu.:3.00   1st Qu.: 9.438   2:9  
##  Mode  :character   Median :3.5   Median :4.00   Median :10.976   3:9  
##                     Mean   :3.5   Mean   :4.13   Mean   :11.517   4:9  
##                     3rd Qu.:5.0   3rd Qu.:5.00   3rd Qu.:12.906   5:9  
##                     Max.   :6.0   Max.   :7.00   Max.   :18.889   6:9

Look at descriptive statistics

library(psych)
#get descriptives for the MLT at each time point. "group1" is Ftime
describeBy(x=mydata$MLT, group=mydata$FTime, mat=TRUE, digits=3 ) 
##     item group1 vars n   mean    sd median trimmed   mad    min    max  range
## X11    1      1    1 9  9.697 2.043  9.792   9.697 2.162  6.895 12.750  5.855
## X12    2      2    1 9 10.364 1.654 10.059  10.364 1.693  8.476 12.824  4.347
## X13    3      3    1 9 11.474 3.779 10.357  11.474 2.848  8.158 18.250 10.092
## X14    4      4    1 9 11.404 2.719 10.600  11.404 1.291  8.579 16.182  7.603
## X15    5      5    1 9 13.321 2.556 13.438  13.321 1.062 10.312 18.889  8.576
## X16    6      6    1 9 12.843 1.811 11.857  12.843 1.596 10.625 16.100  5.475
##      skew kurtosis    se
## X11 0.259   -1.423 0.681
## X12 0.367   -1.605 0.551
## X13 0.794   -1.128 1.260
## X14 0.793   -1.043 0.906
## X15 0.793   -0.101 0.852
## X16 0.504   -1.351 0.604