Often in second language research, we want to know if participants’ performance changes over time (perhaps in relation to a particular instructional method) using a pre-test and a post-test. Because the same individuals complete the pre-test and post-test, we do NOT meet the assumption of independence, so we cannot use an independent samples t-test. Fortunately, there are multiple methods for measuring differences in paired samples. In this tutorial, we will discuss the dependent samples t-test and the Wilcoxon signed rank test. Note that in some studies, we actually have more than two tests (e.g., a pre-test, a post-test, and a delayed post test). In such a case, we would need to use a repeated measures ANOVA or a linear-mixed effects model (these will be covered in upcoming tutorials).
In this tutorial, we will not be looking at a pre-post test design, but we will be looking a related data, namely essays written by the same individuals.
The data for this tutorial comprise concreteness scores, range scores, proportion of 1,000 word list words (i.e., the most frequent 1000 words in the English language), proportion of 2000 word list words (i.e., the second most frequent 1000 words in the English language), and proportion of academic word list (AWL) words for 500 L2 English essays written by L1 users of Mandarin Chinese. Each participant wrote two essays. One of these essays was written in response to a “part-time job” prompt (PTJ), and the other to a “smoking” prompt (SMK). See the ICNALE corpus for further information about the characteristics of the learner corpus.
The essays were processed using TAALES and another in-house Python script to generate the index scores. For this tutorial, we will be examining the degree to which writing prompt affects the average concreteness scores of the words used in an essay. Note that (for argumentative essays) concreteness scores tend to be negatively correlated with essay quality score and judgements of lexical proficiency. In other words, argumentative essays that (on average) include less concrete words (i.e., more abstract words) tend to earn higher scores. However, it is not clear how essay prompt affects these scores. This is an important issue in assessment, because we often want to give students different versions of the “same” test, but also want to treat scores across these two versions as equal. This issue is what we will be examining in this tutorial.
mydata <- read.csv("data/paired_samples_data_long.csv", header = TRUE)
summary(mydata)
## Participant Prompt n_words Classic_1K_Tokens
## Length:500 Length:500 Min. :179.0 Min. :0.7127
## Class :character Class :character 1st Qu.:210.0 1st Qu.:0.8009
## Mode :character Mode :character Median :223.0 Median :0.8549
## Mean :230.6 Mean :0.8471
## 3rd Qu.:244.0 3rd Qu.:0.8945
## Max. :342.0 Max. :0.9665
## Classic_2K_Tokens Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW
## Min. :0.008584 Min. :0.00000 Min. :275.1 Min. :326.4
## 1st Qu.:0.034578 1st Qu.:0.01951 1st Qu.:299.3 1st Qu.:356.6
## Median :0.072904 Median :0.03866 Median :305.1 Median :369.0
## Mean :0.082979 Mean :0.03926 Mean :305.5 Mean :369.6
## 3rd Qu.:0.130045 3rd Qu.:0.05717 3rd Qu.:311.5 3rd Qu.:380.9
## Max. :0.182320 Max. :0.09796 Max. :333.2 Max. :424.5
## SUBTLEXus_Range_AW SUBTLEXus_Range_CW
## Min. :4689 Min. :2665
## 1st Qu.:5662 1st Qu.:3718
## Median :5900 Median :4065
## Mean :5911 Mean :4129
## 3rd Qu.:6169 3rd Qu.:4540
## Max. :6936 Max. :5772
In order to ensure that our paired samples tests are conducted correctly, we will also order our data by participant.
library(dplyr) #load dplyr
mydata.2 <-arrange(mydata, Participant, Prompt) #sort by Participant, then by prompt
head(mydata.2) #check the first few entries to make sure things are sorted correctly
## Participant Prompt n_words Classic_1K_Tokens Classic_2K_Tokens
## 1 W_CHN_001_B1_1 PTJ 262 0.8750000 0.04411765
## 2 W_CHN_001_B1_1 SMK 220 0.7762557 0.13698630
## 3 W_CHN_002_B1_1 PTJ 201 0.9014778 0.02955665
## 4 W_CHN_002_B1_1 SMK 299 0.8282828 0.11784512
## 5 W_CHN_003_B1_1 PTJ 236 0.9166667 0.02916667
## 6 W_CHN_003_B1_1 SMK 255 0.7874016 0.15354331
## Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW SUBTLEXus_Range_AW
## 1 0.058823529 309.0459 368.2685 6023.526
## 2 0.027397260 307.6379 386.1000 5783.734
## 3 0.064039409 295.7416 342.8409 6074.050
## 4 0.026936027 291.0928 360.6436 5863.631
## 5 0.037500000 321.1571 384.7647 5901.640
## 6 0.003937008 319.6798 380.8056 5615.510
## SUBTLEXus_Range_CW
## 1 4404.554
## 2 3783.229
## 3 4233.294
## 4 3956.669
## 5 4448.389
## 6 3960.383
One way to visualize the data is to use box plots, much like we did with our independent samples t-test. However, due to the format of our data, we will have to add each boxplot individually.
library(ggplot2)
library(viridis) #color-friendly palettes
g1<- ggplot(mydata.2, aes(x=Prompt, y=MRC_Concreteness_AW, color=Prompt)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.2), color="grey") +
#scale_color_viridis_d() #either do this or use custom colors for color friendly visualization
scale_color_manual(values = c("#377eb8", "orange"))
#print(g1)
While this view gives us a general impression of the differences between the two prompts, it DOESN’T show us differences by individual. The following plot will be a little messy because we have 250 participants. However, most studies will have far fewer participants (which results in a cleaner plot).
g2 <- ggplot(mydata.2, aes(x=Prompt, y=MRC_Concreteness_AW)) +
geom_boxplot() +
geom_point(aes(color = Participant), show.legend = FALSE) +
geom_line(aes(group = Participant, color = Participant), show.legend = FALSE) +
scale_color_viridis_d()
#print(g2)
Just for illustrative purposes, lets pretend that our dataset only included the first 15 participants in our larger dataset. This view lets us see that SOME participants have lower meaningfullness scores for the SMK prompt than the PTJ, but overall, essays written in response to the part time job prompt tend to have lower scores.
mydata.3 <- mydata.2[1:30, ] #create new dataframe with the first 30 rows from mydata.2
g3 <- ggplot(mydata.3, aes(x=Prompt,y=MRC_Concreteness_AW)) +
geom_boxplot() +
geom_point(aes(color = Participant), show.legend = FALSE) +
geom_line(aes(group = Participant, color = Participant), show.legend = FALSE)+
scale_color_viridis_d()
#print(g3)
To conduct a dependent samples T-test, we first have to check for assumptions. These assumptions are almost exactly the same as for an independent samples T-test:
First, we can check visually for normality. Note that we will use a “group” in ggplot and will also add “facet_wrap”, which allows us to see individual plots for each level of a categorical variable (in this case, prompt).
g4 <- ggplot(mydata.2, aes(x = MRC_Concreteness_AW, group = Prompt, fill = Prompt))+
geom_histogram(binwidth = 2, color = "black") +#adjust bin width and outline color for bars
#scale_fill_viridis_d() + #either do this or use custom colors
scale_fill_manual(values = c("#377eb8", "orange")) + #custom colors for the fill
facet_wrap(~Prompt)
#print(g4)
We can also check this with a density plot:
g5 <- ggplot(mydata.2, aes(x = MRC_Concreteness_AW, group = Prompt, fill = Prompt))+
geom_density(alpha = 0.4)+
scale_fill_manual(values = c("#377eb8", "orange")) #custom colors for the fill
#print(g5)
Based on the histograms and density plots, is the data normally distributed?
We can also run Shapiro-Wilk tests:
#load dplyr package, which helps us manipulate datasets:
library(dplyr)
#create a new dataframe that includes only Beginner:
smk.ds <- mydata %>% filter(Prompt == "SMK")
#create a new dataframe that includes only Int:
ptj.ds <- mydata %>% filter(Prompt == "PTJ")
#Test normality for MRC_Concreteness_AW in the "Smoke" essays
shapiro.test(smk.ds$MRC_Concreteness_AW) #p = 0..8774
##
## Shapiro-Wilk normality test
##
## data: smk.ds$MRC_Concreteness_AW
## W = 0.99665, p-value = 0.8774
#Test normality for MRC_Concreteness_AW in the "Part-time job" essays
shapiro.test(ptj.ds$MRC_Concreteness_AW) #p = 0.02948
##
## Shapiro-Wilk normality test
##
## data: ptj.ds$MRC_Concreteness_AW
## W = 0.98758, p-value = 0.02948
According to the Shapiro-Wilk tests, is the data normally distributed (hint, it is for one prompt but not the other).
We will re-look at our boxplots below to visually inspect the degree to which our datasets have roughly equal variance:
library(ggplot2)
g6 <- ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW, color = Prompt)) +
geom_boxplot()+
scale_color_manual(values = c("#377eb8", "orange")) #custom colors
#print(g6)
According to the boxplots, how does the variance differ (and to what degree)?
Now, let run Levene’s test to determine whether the assumption of homogeneity of variance is violated:
library(car)
leveneTest(MRC_Concreteness_AW ~ Prompt, mydata) #the syntax here is variable ~ grouping variable, dataframe
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.7595 0.05307 .
## 498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to Levene’s test, our data meet the assumption of equal variance given an alpha level of .05 (but just barely!!! Though do note that differences test are robust to the violation of homogeneity of variance as long as the number of samples per group is roughly equal)
Provided that our data meets the assumptions, we can run a dependent samples t-test to determine whether there are differences in MRC_Concreteness_AW scores across the two prompts.
t.test(MRC_Concreteness_AW~Prompt, paired = TRUE, data = mydata.2)
##
## Paired t-test
##
## data: MRC_Concreteness_AW by Prompt
## t = -4.3958, df = 249, p-value = 1.636e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.245634 -1.618281
## sample estimates:
## mean of the differences
## -2.931957
The results indicate that the differences between the two prompts are indeed significantly different (p = .00001636).
Now, we will check the effect size using Cohen’s d:
library(psych)
cohen.d(mydata.2,"Prompt")
## Call: cohen.d(x = mydata.2, group = "Prompt")
## Cohen d statistic of difference between two means
## lower effect upper
## Participant -0.18 0.00 0.18
## n_words -0.41 -0.24 -0.06
## Classic_1K_Tokens -3.60 -3.27 -2.93
## Classic_2K_Tokens 4.66 5.15 5.63
## Classic_AWL_Tokens -2.75 -2.47 -2.19
## MRC_Concreteness_AW 0.14 0.32 0.49
## MRC_Concreteness_CW 0.91 1.11 1.31
## SUBTLEXus_Range_AW -1.35 -1.15 -0.94
## SUBTLEXus_Range_CW -1.68 -1.46 -1.24
##
## Multivariate (Mahalanobis) distance between groups
## [1] 6.5
## r equivalent of difference between two means
## Participant n_words Classic_1K_Tokens Classic_2K_Tokens
## 0.00 -0.12 -0.85 0.93
## Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW SUBTLEXus_Range_AW
## -0.78 0.16 0.49 -0.50
## SUBTLEXus_Range_CW
## -0.59
If we look at the entry for MRC_Concreteness, we see that our effect size (d = .32) represents a small effect.
If our data do not meet the assumptions of the paired samples T-test, we can use the paired samples Wilcoxon signed rank test.
This test is easy to compute in R:
wilcox.test(MRC_Concreteness_AW~Prompt, paired = TRUE, data = mydata.2)
##
## Wilcoxon signed rank test with continuity correction
##
## data: MRC_Concreteness_AW by Prompt
## V = 10656, p-value = 1.104e-05
## alternative hypothesis: true location shift is not equal to 0
Based on this test, we see that MRC_Concreteness_AW values different significantly between the two prompts (p = .0000104).
We then compute the effect size:
library(rcompanion) #don't forget to install if you haven't already
wilcoxonR(mydata.2$MRC_Concreteness_AW,mydata.2$Prompt)
## r
## -0.161
Our effect size is small (r = .161), according to Cohen’s (1988) guidelines for r.
Another alternative is the Sign test. This takes a little more work to run in R, but is still fairly simple.
#create filtered versions of each variable
concreteness.smk = mydata.2$MRC_Concreteness_AW [mydata.2$Prompt == "SMK"]
concreteness.ptj = mydata.2$MRC_Concreteness_AW [mydata.2$Prompt == "PTJ"]
#load library
library(BSDA)
#run sign test
SIGN.test(x = concreteness.smk, y = concreteness.ptj, alternative = "two.sided", conf.level = 0.95)
##
## Dependent-samples Sign-Test
##
## data: concreteness.smk and concreteness.ptj
## S = 156, p-value = 0.0001061
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
## 1.316834 5.096621
## sample estimates:
## median of x-y
## 3.028521
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9336 1.6432 4.7569
## Interpolated CI 0.9500 1.3168 5.0966
## Upper Achieved CI 0.9503 1.3110 5.1027
As we see, the Sign test also indicates that there is a significant difference in concreteness scores across the two prompts (p = .0001061).
Based on the results of our study, is there a significant difference in concreteness scores across the two prompts? If so, which prompt resulted in higher concreteness scores? Why do you think this might be? If there are significant differences, can we say that the difference is meaningful? What evidence do we have (and/or not have! for/against this idea)?