Back to Homepage

Paired samples difference tests

Often in second language research, we want to know if participants’ performance changes over time (perhaps in relation to a particular instructional method) using a pre-test and a post-test. Because the same individuals complete the pre-test and post-test, we do NOT meet the assumption of independence, so we cannot use an independent samples t-test. Fortunately, there are multiple methods for measuring differences in paired samples. In this tutorial, we will discuss the dependent samples t-test and the Wilcoxon signed rank test. Note that in some studies, we actually have more than two tests (e.g., a pre-test, a post-test, and a delayed post test). In such a case, we would need to use a repeated measures ANOVA or a linear-mixed effects model (these will be covered in upcoming tutorials).

Data for this tutorial

In this tutorial, we will not be looking at a pre-post test design, but we will be looking a related data, namely essays written by the same individuals.

The data for this tutorial comprise concreteness scores, range scores, proportion of 1,000 word list words (i.e., the most frequent 1000 words in the English language), proportion of 2000 word list words (i.e., the second most frequent 1000 words in the English language), and proportion of academic word list (AWL) words for 500 L2 English essays written by L1 users of Mandarin Chinese. Each participant wrote two essays. One of these essays was written in response to a “part-time job” prompt (PTJ), and the other to a “smoking” prompt (SMK). See the ICNALE corpus for further information about the characteristics of the learner corpus.

The essays were processed using TAALES and another in-house Python script to generate the index scores. For this tutorial, we will be examining the degree to which writing prompt affects the average concreteness scores of the words used in an essay. Note that (for argumentative essays) concreteness scores tend to be negatively correlated with essay quality score and judgements of lexical proficiency. In other words, argumentative essays that (on average) include less concrete words (i.e., more abstract words) tend to earn higher scores. However, it is not clear how essay prompt affects these scores. This is an important issue in assessment, because we often want to give students different versions of the “same” test, but also want to treat scores across these two versions as equal. This issue is what we will be examining in this tutorial.

mydata <- read.csv("data/paired_samples_data_long.csv", header = TRUE)
summary(mydata)
##  Participant           Prompt             n_words      Classic_1K_Tokens
##  Length:500         Length:500         Min.   :179.0   Min.   :0.7127   
##  Class :character   Class :character   1st Qu.:210.0   1st Qu.:0.8009   
##  Mode  :character   Mode  :character   Median :223.0   Median :0.8549   
##                                        Mean   :230.6   Mean   :0.8471   
##                                        3rd Qu.:244.0   3rd Qu.:0.8945   
##                                        Max.   :342.0   Max.   :0.9665   
##  Classic_2K_Tokens  Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW
##  Min.   :0.008584   Min.   :0.00000    Min.   :275.1       Min.   :326.4      
##  1st Qu.:0.034578   1st Qu.:0.01951    1st Qu.:299.3       1st Qu.:356.6      
##  Median :0.072904   Median :0.03866    Median :305.1       Median :369.0      
##  Mean   :0.082979   Mean   :0.03926    Mean   :305.5       Mean   :369.6      
##  3rd Qu.:0.130045   3rd Qu.:0.05717    3rd Qu.:311.5       3rd Qu.:380.9      
##  Max.   :0.182320   Max.   :0.09796    Max.   :333.2       Max.   :424.5      
##  SUBTLEXus_Range_AW SUBTLEXus_Range_CW
##  Min.   :4689       Min.   :2665      
##  1st Qu.:5662       1st Qu.:3718      
##  Median :5900       Median :4065      
##  Mean   :5911       Mean   :4129      
##  3rd Qu.:6169       3rd Qu.:4540      
##  Max.   :6936       Max.   :5772

In order to ensure that our paired samples tests are conducted correctly, we will also order our data by participant.

library(dplyr) #load dplyr

mydata.2 <-arrange(mydata, Participant, Prompt) #sort by Participant, then by prompt
head(mydata.2) #check the first few entries to make sure things are sorted correctly
##      Participant Prompt n_words Classic_1K_Tokens Classic_2K_Tokens
## 1 W_CHN_001_B1_1    PTJ     262         0.8750000        0.04411765
## 2 W_CHN_001_B1_1    SMK     220         0.7762557        0.13698630
## 3 W_CHN_002_B1_1    PTJ     201         0.9014778        0.02955665
## 4 W_CHN_002_B1_1    SMK     299         0.8282828        0.11784512
## 5 W_CHN_003_B1_1    PTJ     236         0.9166667        0.02916667
## 6 W_CHN_003_B1_1    SMK     255         0.7874016        0.15354331
##   Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW SUBTLEXus_Range_AW
## 1        0.058823529            309.0459            368.2685           6023.526
## 2        0.027397260            307.6379            386.1000           5783.734
## 3        0.064039409            295.7416            342.8409           6074.050
## 4        0.026936027            291.0928            360.6436           5863.631
## 5        0.037500000            321.1571            384.7647           5901.640
## 6        0.003937008            319.6798            380.8056           5615.510
##   SUBTLEXus_Range_CW
## 1           4404.554
## 2           3783.229
## 3           4233.294
## 4           3956.669
## 5           4448.389
## 6           3960.383

Visualizing the data

One way to visualize the data is to use box plots, much like we did with our independent samples t-test. However, due to the format of our data, we will have to add each boxplot individually.

library(ggplot2)
library(viridis) #color-friendly palettes

g1<- ggplot(mydata.2, aes(x=Prompt, y=MRC_Concreteness_AW, color=Prompt)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.2), color="grey") +
  #scale_color_viridis_d()  #either do this or use custom colors for color friendly visualization
  scale_color_manual(values = c("#377eb8", "orange"))

#print(g1)

Boxplots comparing two prompts, PTJ and SMK, on the MRC_Concreteness_AW score. The PTJ boxplot is on the left, while the SMK boxplot is on the right. Both boxplots are overlaid with grey jittered points that represent individual data points.

While this view gives us a general impression of the differences between the two prompts, it DOESN’T show us differences by individual. The following plot will be a little messy because we have 250 participants. However, most studies will have far fewer participants (which results in a cleaner plot).

g2 <- ggplot(mydata.2, aes(x=Prompt, y=MRC_Concreteness_AW)) +
  geom_boxplot() +
  geom_point(aes(color = Participant), show.legend = FALSE) +
  geom_line(aes(group = Participant, color = Participant), show.legend = FALSE) +
  scale_color_viridis_d()

#print(g2)

Lines comparing MRC_Concreteness_AW scores between two prompts, PTJ and SMK. The boxplot for PTJ is on the left and SMK on the right, with individual participant scores connected by lines.

Just for illustrative purposes, lets pretend that our dataset only included the first 15 participants in our larger dataset. This view lets us see that SOME participants have lower meaningfullness scores for the SMK prompt than the PTJ, but overall, essays written in response to the part time job prompt tend to have lower scores.

mydata.3 <- mydata.2[1:30, ] #create new dataframe with the first 30 rows from mydata.2

g3 <- ggplot(mydata.3, aes(x=Prompt,y=MRC_Concreteness_AW)) +
  geom_boxplot() +
  geom_point(aes(color = Participant), show.legend = FALSE) +
  geom_line(aes(group = Participant, color = Participant), show.legend = FALSE)+
  scale_color_viridis_d()

#print(g3)

Lines comparing MRC_Concreteness_AW scores for 30 participants, comparing two prompts, PTJ and SMK. The boxplot for PTJ is on the left and SMK on the right, with individual participant scores connected by lines.

Dependent samples T-test

To conduct a dependent samples T-test, we first have to check for assumptions. These assumptions are almost exactly the same as for an independent samples T-test:

  • The observations must be independent (within each sample)
  • Each sample is normally distributed, and the variance is equal across samples
  • There is only one comparison (a repeated measures ANOVA is appropriate for multiple comparisons, stay tuned)
  • The data is continuous

Testing the assumption of normality

First, we can check visually for normality. Note that we will use a “group” in ggplot and will also add “facet_wrap”, which allows us to see individual plots for each level of a categorical variable (in this case, prompt).

g4 <-  ggplot(mydata.2, aes(x = MRC_Concreteness_AW, group = Prompt, fill = Prompt))+
  geom_histogram(binwidth = 2, color = "black") +#adjust bin width and outline color for bars
  #scale_fill_viridis_d() + #either do this or use custom colors 
  scale_fill_manual(values = c("#377eb8", "orange")) +  #custom colors for the fill
  facet_wrap(~Prompt)

#print(g4)

Plots showing the distribution of MRC_Concreteness_AW scores for two different prompt, PTJ and SMK. The histrogram for PTJ is on the left, while the histogram for SMK is on the right.

We can also check this with a density plot:

g5 <- ggplot(mydata.2, aes(x = MRC_Concreteness_AW, group = Prompt, fill = Prompt))+
  geom_density(alpha = 0.4)+
  scale_fill_manual(values = c("#377eb8", "orange")) #custom colors for the fill

#print(g5)

Density plots showing the distribution of MRC_Concreteness_AW scores for two different prompts, PTJ and SMK. The density for PTJ is shaded in blue and for SMK in orange, with two plots having overlapping regions.

Based on the histograms and density plots, is the data normally distributed?

We can also run Shapiro-Wilk tests:

#load dplyr package, which helps us manipulate datasets:
library(dplyr)

#create a new dataframe that includes only Beginner:
smk.ds <- mydata %>% filter(Prompt == "SMK")
#create a new dataframe that includes only Int:
ptj.ds <- mydata %>% filter(Prompt == "PTJ")

#Test normality for MRC_Concreteness_AW in the "Smoke" essays
shapiro.test(smk.ds$MRC_Concreteness_AW) #p = 0..8774
## 
##  Shapiro-Wilk normality test
## 
## data:  smk.ds$MRC_Concreteness_AW
## W = 0.99665, p-value = 0.8774
#Test normality for MRC_Concreteness_AW in the "Part-time job" essays
shapiro.test(ptj.ds$MRC_Concreteness_AW) #p = 0.02948
## 
##  Shapiro-Wilk normality test
## 
## data:  ptj.ds$MRC_Concreteness_AW
## W = 0.98758, p-value = 0.02948

According to the Shapiro-Wilk tests, is the data normally distributed (hint, it is for one prompt but not the other).

Testing the assumption of equal variance (homogeneity of variance)

We will re-look at our boxplots below to visually inspect the degree to which our datasets have roughly equal variance:

library(ggplot2)

g6 <- ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW, color = Prompt)) +
  geom_boxplot()+
  scale_color_manual(values = c("#377eb8", "orange")) #custom colors

#print(g6)

Boxplots comparing two prompts, PTJ and SMK, on the MRC_Concreteness_AW score. The PTJ boxplot is on the left, while the SMK boxplot is on the right.

According to the boxplots, how does the variance differ (and to what degree)?

Now, let run Levene’s test to determine whether the assumption of homogeneity of variance is violated:

library(car)
leveneTest(MRC_Concreteness_AW ~ Prompt, mydata) #the syntax here is variable ~ grouping variable, dataframe
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  3.7595 0.05307 .
##       498                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to Levene’s test, our data meet the assumption of equal variance given an alpha level of .05 (but just barely!!! Though do note that differences test are robust to the violation of homogeneity of variance as long as the number of samples per group is roughly equal)

Running a dependent samples t-test

Provided that our data meets the assumptions, we can run a dependent samples t-test to determine whether there are differences in MRC_Concreteness_AW scores across the two prompts.

t.test(MRC_Concreteness_AW~Prompt, paired = TRUE, data = mydata.2)
## 
##  Paired t-test
## 
## data:  MRC_Concreteness_AW by Prompt
## t = -4.3958, df = 249, p-value = 1.636e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.245634 -1.618281
## sample estimates:
## mean of the differences 
##               -2.931957

The results indicate that the differences between the two prompts are indeed significantly different (p = .00001636).

Now, we will check the effect size using Cohen’s d:

library(psych)
cohen.d(mydata.2,"Prompt")
## Call: cohen.d(x = mydata.2, group = "Prompt")
## Cohen d statistic of difference between two means
##                     lower effect upper
## Participant         -0.18   0.00  0.18
## n_words             -0.41  -0.24 -0.06
## Classic_1K_Tokens   -3.60  -3.27 -2.93
## Classic_2K_Tokens    4.66   5.15  5.63
## Classic_AWL_Tokens  -2.75  -2.47 -2.19
## MRC_Concreteness_AW  0.14   0.32  0.49
## MRC_Concreteness_CW  0.91   1.11  1.31
## SUBTLEXus_Range_AW  -1.35  -1.15 -0.94
## SUBTLEXus_Range_CW  -1.68  -1.46 -1.24
## 
## Multivariate (Mahalanobis) distance between groups
## [1] 6.5
## r equivalent of difference between two means
##         Participant             n_words   Classic_1K_Tokens   Classic_2K_Tokens 
##                0.00               -0.12               -0.85                0.93 
##  Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW  SUBTLEXus_Range_AW 
##               -0.78                0.16                0.49               -0.50 
##  SUBTLEXus_Range_CW 
##               -0.59

If we look at the entry for MRC_Concreteness, we see that our effect size (d = .32) represents a small effect.

Paired samples Wilcoxon signed rank test

If our data do not meet the assumptions of the paired samples T-test, we can use the paired samples Wilcoxon signed rank test.

This test is easy to compute in R:

wilcox.test(MRC_Concreteness_AW~Prompt, paired = TRUE, data = mydata.2)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  MRC_Concreteness_AW by Prompt
## V = 10656, p-value = 1.104e-05
## alternative hypothesis: true location shift is not equal to 0

Based on this test, we see that MRC_Concreteness_AW values different significantly between the two prompts (p = .0000104).

We then compute the effect size:

library(rcompanion) #don't forget to install if you haven't already
wilcoxonR(mydata.2$MRC_Concreteness_AW,mydata.2$Prompt)
##      r 
## -0.161

Our effect size is small (r = .161), according to Cohen’s (1988) guidelines for r.

Sign test

Another alternative is the Sign test. This takes a little more work to run in R, but is still fairly simple.

#create filtered versions of each variable
concreteness.smk = mydata.2$MRC_Concreteness_AW [mydata.2$Prompt == "SMK"] 
concreteness.ptj = mydata.2$MRC_Concreteness_AW [mydata.2$Prompt == "PTJ"]
#load library
library(BSDA)
#run sign test
SIGN.test(x = concreteness.smk, y = concreteness.ptj, alternative = "two.sided", conf.level = 0.95)
## 
##  Dependent-samples Sign-Test
## 
## data:  concreteness.smk and concreteness.ptj
## S = 156, p-value = 0.0001061
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
##  1.316834 5.096621
## sample estimates:
## median of x-y 
##      3.028521 
## 
## Achieved and Interpolated Confidence Intervals: 
## 
##                   Conf.Level L.E.pt U.E.pt
## Lower Achieved CI     0.9336 1.6432 4.7569
## Interpolated CI       0.9500 1.3168 5.0966
## Upper Achieved CI     0.9503 1.3110 5.1027

As we see, the Sign test also indicates that there is a significant difference in concreteness scores across the two prompts (p = .0001061).

Follow up

Based on the results of our study, is there a significant difference in concreteness scores across the two prompts? If so, which prompt resulted in higher concreteness scores? Why do you think this might be? If there are significant differences, can we say that the difference is meaningful? What evidence do we have (and/or not have! for/against this idea)?