Back to Homepage

Paired samples difference tests

Often in second language research, we want to know if participants’ performance changes over time (perhaps in relation to a particular instructional method) using a pre-test and a post-test. Because the same individuals complete the pre-test and post-test, we do NOT meet the assumption of independence, so we cannot use an independent samples t-test. Fortunately, there are multiple methods for measuring differences in paired samples. In this tutorial, we will discuss the dependent samples t-test and the Wilcoxon signed rank test. Note that in some studies, we actually have more than two tests (e.g., a pre-test, a post-test, and a delayed post test). In such a case, we would need to use a repeated measures ANOVA or a linear-mixed effects model (these will be covered in upcoming tutorials).

Data for this tutorial

In this tutorial, we will not be looking at a pre-post test design, but we will be looking a related data, namely essays written by the same individuals.

The data for this tutorial comprise concreteness scores, range scores, proportion of 1,000 word list words (i.e., the most frequent 1000 words in the English language), proportion of 2000 word list words (i.e., the second most frequent 1000 words in the English language), and proportion of academic word list (AWL) words for 500 L2 English essays written by L1 users of Mandarin Chinese. Each participant wrote two essays. One of these essays was written in response to a “part-time job” prompt (PTJ), and the other to a “smoking” prompt (SMK). See the ICNALE corpus for further information about the characteristics of the learner corpus.

The essays were processed using TAALES and another in-house Python script to generate the index scores. For this tutorial, we will be examining the degree to which writing prompt affects the average concreteness scores of the words used in an essay. Note that (for argumentative essays) concreteness scores tend to be negatively correlated with essay quality score and judgements of lexical proficiency. In other words, argumentative essays that (on average) include less concrete words (i.e., more abstract words) tend to earn higher scores. However, it is not clear how essay prompt affects these scores. This is an important issue in assessment, because we often want to give students different versions of the “same” test, but also want to treat scores across these two versions as equal. This issue is what we will be examining in this tutorial.

mydata <- read.csv("data/paired_samples_data_long.csv", header = TRUE)
summary(mydata)
##  Participant           Prompt             n_words      Classic_1K_Tokens
##  Length:500         Length:500         Min.   :179.0   Min.   :0.7127   
##  Class :character   Class :character   1st Qu.:210.0   1st Qu.:0.8009   
##  Mode  :character   Mode  :character   Median :223.0   Median :0.8549   
##                                        Mean   :230.6   Mean   :0.8471   
##                                        3rd Qu.:244.0   3rd Qu.:0.8945   
##                                        Max.   :342.0   Max.   :0.9665   
##  Classic_2K_Tokens  Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW
##  Min.   :0.008584   Min.   :0.00000    Min.   :275.1       Min.   :326.4      
##  1st Qu.:0.034578   1st Qu.:0.01951    1st Qu.:299.3       1st Qu.:356.6      
##  Median :0.072904   Median :0.03866    Median :305.1       Median :369.0      
##  Mean   :0.082979   Mean   :0.03926    Mean   :305.5       Mean   :369.6      
##  3rd Qu.:0.130045   3rd Qu.:0.05717    3rd Qu.:311.5       3rd Qu.:380.9      
##  Max.   :0.182320   Max.   :0.09796    Max.   :333.2       Max.   :424.5      
##  SUBTLEXus_Range_AW SUBTLEXus_Range_CW
##  Min.   :4689       Min.   :2665      
##  1st Qu.:5662       1st Qu.:3718      
##  Median :5900       Median :4065      
##  Mean   :5911       Mean   :4129      
##  3rd Qu.:6169       3rd Qu.:4540      
##  Max.   :6936       Max.   :5772

In order to ensure that our paired samples tests are conducted correctly, we will also order our data by participant.

library(dplyr) #load dplyr
mydata.2 <-arrange(mydata, Participant, Prompt) #sort by Participant, then by prompt
head(mydata.2) #check the first few entries to make sure things are sorted correctly
##      Participant Prompt n_words Classic_1K_Tokens Classic_2K_Tokens
## 1 W_CHN_001_B1_1    PTJ     262         0.8750000        0.04411765
## 2 W_CHN_001_B1_1    SMK     220         0.7762557        0.13698630
## 3 W_CHN_002_B1_1    PTJ     201         0.9014778        0.02955665
## 4 W_CHN_002_B1_1    SMK     299         0.8282828        0.11784512
## 5 W_CHN_003_B1_1    PTJ     236         0.9166667        0.02916667
## 6 W_CHN_003_B1_1    SMK     255         0.7874016        0.15354331
##   Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW SUBTLEXus_Range_AW
## 1        0.058823529            309.0459            368.2685           6023.526
## 2        0.027397260            307.6379            386.1000           5783.734
## 3        0.064039409            295.7416            342.8409           6074.050
## 4        0.026936027            291.0928            360.6436           5863.631
## 5        0.037500000            321.1571            384.7647           5901.640
## 6        0.003937008            319.6798            380.8056           5615.510
##   SUBTLEXus_Range_CW
## 1           4404.554
## 2           3783.229
## 3           4233.294
## 4           3956.669
## 5           4448.389
## 6           3960.383

Visualizing the data

One way to visualize the data is to use box plots, much like we did with our independent samples t-test. However, due to the format of our data, we will have to add each boxplot individually.

library(ggplot2)
ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW, color = Prompt)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.2), color = "grey")

While this view gives us a general impression of the differences between the two prompts, it DOESN’T show us differences by individual. The following plot will be a little messy because we have 250 participants. However, most studies will have far fewer particpants (which results in a cleaner plot).

ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW)) +
  geom_boxplot() +
  geom_point(aes(color = Participant), show.legend = FALSE) +
  geom_line(aes(group = Participant, color = Participant), show.legend = FALSE)