Often in second language research, we want to know if participants’ performance changes over time (perhaps in relation to a particular instructional method) using a pre-test and a post-test. Because the same individuals complete the pre-test and post-test, we do NOT meet the assumption of independence, so we cannot use an independent samples t-test. Fortunately, there are multiple methods for measuring differences in paired samples. In this tutorial, we will discuss the dependent samples t-test and the Wilcoxon signed rank test. Note that in some studies, we actually have more than two tests (e.g., a pre-test, a post-test, and a delayed post test). In such a case, we would need to use a repeated measures ANOVA or a linear-mixed effects model (these will be covered in upcoming tutorials).
In this tutorial, we will not be looking at a pre-post test design, but we will be looking a related data, namely essays written by the same individuals.
The data for this tutorial comprise concreteness scores, range scores, proportion of 1,000 word list words (i.e., the most frequent 1000 words in the English language), proportion of 2000 word list words (i.e., the second most frequent 1000 words in the English language), and proportion of academic word list (AWL) words for 500 L2 English essays written by L1 users of Mandarin Chinese. Each participant wrote two essays. One of these essays was written in response to a “part-time job” prompt (PTJ), and the other to a “smoking” prompt (SMK). See the ICNALE corpus for further information about the characteristics of the learner corpus.
The essays were processed using TAALES and another in-house Python script to generate the index scores. For this tutorial, we will be examining the degree to which writing prompt affects the average concreteness scores of the words used in an essay. Note that (for argumentative essays) concreteness scores tend to be negatively correlated with essay quality score and judgements of lexical proficiency. In other words, argumentative essays that (on average) include less concrete words (i.e., more abstract words) tend to earn higher scores. However, it is not clear how essay prompt affects these scores. This is an important issue in assessment, because we often want to give students different versions of the “same” test, but also want to treat scores across these two versions as equal. This issue is what we will be examining in this tutorial.
mydata <- read.csv("data/paired_samples_data_long.csv", header = TRUE)
summary(mydata)
## Participant Prompt n_words Classic_1K_Tokens
## Length:500 Length:500 Min. :179.0 Min. :0.7127
## Class :character Class :character 1st Qu.:210.0 1st Qu.:0.8009
## Mode :character Mode :character Median :223.0 Median :0.8549
## Mean :230.6 Mean :0.8471
## 3rd Qu.:244.0 3rd Qu.:0.8945
## Max. :342.0 Max. :0.9665
## Classic_2K_Tokens Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW
## Min. :0.008584 Min. :0.00000 Min. :275.1 Min. :326.4
## 1st Qu.:0.034578 1st Qu.:0.01951 1st Qu.:299.3 1st Qu.:356.6
## Median :0.072904 Median :0.03866 Median :305.1 Median :369.0
## Mean :0.082979 Mean :0.03926 Mean :305.5 Mean :369.6
## 3rd Qu.:0.130045 3rd Qu.:0.05717 3rd Qu.:311.5 3rd Qu.:380.9
## Max. :0.182320 Max. :0.09796 Max. :333.2 Max. :424.5
## SUBTLEXus_Range_AW SUBTLEXus_Range_CW
## Min. :4689 Min. :2665
## 1st Qu.:5662 1st Qu.:3718
## Median :5900 Median :4065
## Mean :5911 Mean :4129
## 3rd Qu.:6169 3rd Qu.:4540
## Max. :6936 Max. :5772
In order to ensure that our paired samples tests are conducted correctly, we will also order our data by participant.
library(dplyr) #load dplyr
mydata.2 <-arrange(mydata, Participant, Prompt) #sort by Participant, then by prompt
head(mydata.2) #check the first few entries to make sure things are sorted correctly
## Participant Prompt n_words Classic_1K_Tokens Classic_2K_Tokens
## 1 W_CHN_001_B1_1 PTJ 262 0.8750000 0.04411765
## 2 W_CHN_001_B1_1 SMK 220 0.7762557 0.13698630
## 3 W_CHN_002_B1_1 PTJ 201 0.9014778 0.02955665
## 4 W_CHN_002_B1_1 SMK 299 0.8282828 0.11784512
## 5 W_CHN_003_B1_1 PTJ 236 0.9166667 0.02916667
## 6 W_CHN_003_B1_1 SMK 255 0.7874016 0.15354331
## Classic_AWL_Tokens MRC_Concreteness_AW MRC_Concreteness_CW SUBTLEXus_Range_AW
## 1 0.058823529 309.0459 368.2685 6023.526
## 2 0.027397260 307.6379 386.1000 5783.734
## 3 0.064039409 295.7416 342.8409 6074.050
## 4 0.026936027 291.0928 360.6436 5863.631
## 5 0.037500000 321.1571 384.7647 5901.640
## 6 0.003937008 319.6798 380.8056 5615.510
## SUBTLEXus_Range_CW
## 1 4404.554
## 2 3783.229
## 3 4233.294
## 4 3956.669
## 5 4448.389
## 6 3960.383
One way to visualize the data is to use box plots, much like we did with our independent samples t-test. However, due to the format of our data, we will have to add each boxplot individually.
library(ggplot2)
ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW, color = Prompt)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.2), color = "grey")
While this view gives us a general impression of the differences between the two prompts, it DOESN’T show us differences by individual. The following plot will be a little messy because we have 250 participants. However, most studies will have far fewer particpants (which results in a cleaner plot).
ggplot(mydata.2, aes(x=Prompt,y=MRC_Concreteness_AW)) +
geom_boxplot() +
geom_point(aes(color = Participant), show.legend = FALSE) +
geom_line(aes(group = Participant, color = Participant), show.legend = FALSE)