The objectives of this tutorial are to:
In research, we often want to know whether groups differ with regard to a particular characteristic. For example, in Tutorial 2 we looked at how particular classes of vehicles differed with regard to highway fuel efficiency. In the dataset that is visualized below, we can see (for example) that there doesn’t seem to be much of a difference in highway fuel efficiency between compact and midsize vehicles. There does, however, seem to be a reasonably large difference in highway fuel efficiency between midsize vehicles and pickups.
library(ggplot2) #import ggplot2
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
library(viridis) #color-friendly palettes
## Loading required package: viridisLite
g1 <- ggplot(data = mpg) + # create plot using the mpg data frame
geom_boxplot(mapping = aes(x = class, y = hwy, fill=class)) + #create boxplots for each vehicle class based on highway fuel efficiency
scale_fill_viridis(discrete = TRUE)
#print(g1)
While visualizing data can help us identify general trends, the use of inferential statistics (such as T-tests) helps us to formally (and more precisely) discuss relationships in the data (such as differences between groups). When conducting inferential statistics, we use two primary metrics for measuring relationships in a data set, namely p-values and effect sizes. These concepts were introduced in Tutorial 3, but will be revisited (and in some cases expanded on) below. For now, our discussions will be limited to measuring differences in a particular characteristic across two groups (e.g., differences in highway fuel efficiency across midsize vehicles and pickups)
When using an inferential statistic to determine whether a difference exists between two groups, it is common practice to use a null hypothesis, which essentially posits that there is no difference between the two groups (e.g., two types of vehicles) with regard to the characteristic of interest (e.g., highway fuel efficiency). The null hypothesis is used purely for statistical reasons - difference tests (such as the ones we will cover in this tutorial) are designed to measure the probability that two sample distributions could actually represent the same distribution. In most cases, researchers presume that there ARE some differences between the groups (this is called the alternative hypothesis), and hope that they can reject the null hypothesis.
As alluded to in the previous paragraph, a probability value (p-value) indicates the probability that two distributions of data (e.g., highway fuel efficiency for midsize vehicles and highway fuel efficiency for pickups) could actually come from the same distribution. If that probability is at or below a particular threshold (referred to as the alpha or α value), we consider the two distributions to be significantly different. In social science research, we commonly set α to p = .05. As a reminder, p = .05 means that there is only a 5% chance that the two observed distributions could actually come from the same distribution. In less precise, but perhaps easier to grasp terms, p = .05 means that there is only a 5% chance that there are no differences across two groups with regard to a particular characteristic.
It is important to note that the larger our samples are, the more sure we can be about how well the distributions (and specifically, the mean score) in our samples reflect the overall population distribution. This means that the larger our studies are (with regard to participants and/or observations), the more likely we are to get a small p value. In other words, there is a strong link between sample size and p value.
Accordingly, we also need to use a way of measuring the differences of two groups that is independent of sample size. We refer to this type of measurement as an effect size.
An effect size is a sample size independent measure of the relationship between variables. If we are looking at differences between two independent samples using a T-test (as we will later in this tutorial), the effect size indicates the difference in mean scores between two groups. Effect sizes are almost always standardized scores, which means that they can be compared across studies.
While p values indicate how certain we can be about the observed differences between two groups, effect sizes tell us how big the differences actually are. If we are analyzing a small data set, it is entirely possibly for an observed difference between groups to fail to reach statistical significance while demonstrating a large effect. Conversely, if we are analyzing a large data set, it is possible to obtain a tiny p value (i.e., find that there is almost no chance that two distributions actually represent the same distribution) while also finding that the effect size (the size of the difference) is negligible.
In applied linguistics research, we sometimes want to know whether two independent groups (e.g., intact classes) differ with regard to some measure (motivation, vocabulary knowledge, writing skill, etc.). We also often want to know whether one teaching method works better than another method with regard to some outcome (e.g., vocabulary test score, writing quality score, etc.). In order to determine whether two groups differ in some regard (i.e., to address the first issue outlined above), we can use an independent samples t-test (for parametric data) or a Wilcoxon test (for non-parametric data). In order to determine whether one teaching method works better than another we will need a different set of statistical tests (stayed tuned!), but we could use an independent samples t-test to determine whether two groups were different with regard to some variable prior to testing a teaching method.
In this tutorial, we will be looking at argumentative essays written in response to two argumentative prompts (one about smoking in public places and the second about whether college students should have part time jobs). Specifically, we will be determining the degree to which either prompt tends to elicit longer essays (measured in number of words per essay). In short, we will be addressing the following research question:
Do the responses to the two essay prompts (prompt A and prompt B) differ with regard to number of words?
Our null hypothesis will be that there is no difference in number of words between the two prompts.
Independent samples t-tests are rather simple tests that use the sample means and the variance in each sample to determine the probability that the two samples are part of the same population.
Following are the assumptions for an independent samples t-test:
First, we will load some data and check assumptions. You can download the data here. Note that if the file doesn’t download directly you can right click on your browser window and choose the “save as” option to save the file on your computer.
mydata <- read.csv("data/distribution_sample.csv", header = TRUE) #this presumes that we have a folder in our working directory named "data" and that we have a file named "distribution sample" in that folder.
summary(mydata)
## Prompt Score Nwords Frequency
## Length:480 Min. :1.000 Min. : 61.0 Min. :2.963
## Class :character 1st Qu.:3.000 1st Qu.:273.0 1st Qu.:3.187
## Mode :character Median :3.500 Median :321.0 Median :3.237
## Mean :3.427 Mean :317.7 Mean :3.234
## 3rd Qu.:4.000 3rd Qu.:355.2 3rd Qu.:3.284
## Max. :5.000 Max. :586.0 Max. :3.489
In addition to visualizing data, getting descriptive statistics can be helpful in understanding the nature of your dataset. Below we get descriptive statistics for our entire dataset, then we split our dataset into two prompts, and get descriptives statistics for each.
#The first time you use the psych and dplyr libraries, you will need to install them:
#install.packages("psych")
#install.packages("dplyr")
library(psych) #many useful functions
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(dplyr) #helps us sort and filter datasets
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
describe(mydata) #get descriptive statistics
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 480 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00 0.00
## Score 2 480 3.43 0.89 3.50 3.43 0.74 1.00 5.00 4.00 0.00
## Nwords 3 480 317.65 78.28 321.00 316.93 57.82 61.00 586.00 525.00 0.12
## Frequency 4 480 3.23 0.07 3.24 3.23 0.07 2.96 3.49 0.53 -0.10
## kurtosis se
## Prompt* -2.00 0.02
## Score -0.62 0.04
## Nwords 1.10 3.57
## Frequency 0.54 0.00
Prompt A
#create a new dataframe that includes only responses to Prompt A:
promptA <- mydata %>% filter(Prompt == "A")
describe(promptA)
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.38 0.86 3.50 3.39 0.74 1.00 5.00 4.00 0.06
## Nwords 3 240 324.69 79.33 325.00 325.08 57.82 61.00 558.00 497.00 -0.06
## Frequency 4 240 3.24 0.07 3.24 3.24 0.07 2.96 3.45 0.49 -0.16
## kurtosis se
## Prompt* NaN 0.00
## Score -0.67 0.06
## Nwords 1.01 5.12
## Frequency 0.42 0.00
PromptB
##create a new dataframe that includes only responses to Prompt b:
promptB <- mydata %>% filter(Prompt == "B")
describe(promptB)
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.47 0.91 3.50 3.48 0.74 1.00 5.00 4.00 -0.07
## Nwords 3 240 310.61 76.75 313.00 308.98 59.30 86.00 586.00 500.00 0.30
## Frequency 4 240 3.23 0.08 3.23 3.23 0.07 2.98 3.49 0.51 -0.04
## kurtosis se
## Prompt* NaN 0.00
## Score -0.60 0.06
## Nwords 1.34 4.95
## Frequency 0.56 0.00
First, we will visually inspect the data for normality using density plots. We can either display these as two plots (using facet_wrap()) or as a single plot.
library(ggplot2)
g2 <- ggplot(mydata, aes(x=Nwords, color = Prompt)) +
geom_density() +
facet_wrap(~Prompt) +
scale_color_viridis(discrete = TRUE)
#print(g2)
g3 <- ggplot(mydata, aes(x=Nwords, color=Prompt, fill=Prompt)) +
geom_density(alpha=0.4) +
scale_color_viridis(discrete=TRUE) +
scale_fill_viridis(discrete=TRUE)
#print(g3)
Either way, we observe that the two datasets roughly (but certainly not perfectly) represent a normal distribution.
We can also use the (rather stringent) Shapiro-Wilk test on each dataset (Prompt A and Prompt B). As we see below, the Shapiro-Wilk test indicates that the data from both prompts significantly vary from a normal distribution.
#Test normality for Nwords in PromptA
shapiro.test(promptA$Nwords) #p = 0.001872
##
## Shapiro-Wilk normality test
##
## data: promptA$Nwords
## W = 0.98008, p-value = 0.001872
#Test normality for Nwords in PromptB
shapiro.test(promptB$Nwords) #p = 0.0005323
##
## Shapiro-Wilk normality test
##
## data: promptB$Nwords
## W = 0.9766, p-value = 0.0005323
Much like the assumption of normalilty, we can check the assumption of equal variance (usually referred to as “homogeneity of variance”) both visually and with a statistical test (e.g., Levene’s test).
We can get an idea of the variance in distribution plots, but one of the the clearest ways to examine the variance is using a boxplot. Below, we see that the variance appears to be similar across the two prompts. (Note, the boxes represent the middle 50% of the data, the line within each box represents the median value. The boxes are roughly the same size, which indicates that the variance is roughly equal).
custom_colors <- c("orange", "#377eb8") #for color-friendly option
g4 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords, fill = Prompt)) +
geom_boxplot() +
scale_fill_manual(values = custom_colors)
#print(g4)
In addition to visualizing our data, we can run Levene’s test, which is available via the car() package (don’t forget to install it if you haven’t installed it yet). The results below indicate that the variance in Nwords score across the two Prompts in our dataset is not significantly different (p = 0.769). In other words, we very clearly meet the assumption of equal variance.
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
leveneTest(Nwords ~ Prompt, mydata) #the syntax here is variable, grouping variable, dataframe
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0866 0.7687
## 478
Lets revisit our assumptions (and whether or not we meet them):
So, we meet all assumptions except (possibly) the assumption of normality. Below, we will see what to do if we meet all assumptions, and an alternative test we can use if we don’t meet the assumption of normality.
If our data meets the assumptions of a t-test, then we can use the t-test to examine differences between two independent groups (e.g., to determine whether there are differences in essay length based on prompt). Our first step is to visualize the data.
The prototypical plot used to examine two independent groups is the boxplot. We already made one above, but we will repeat it here for good measure :).
Based on the boxplots, we see that the median number of words in Prompt A score is slightly higher than the median number of words in Prompt B, though it is unclear whether this difference will be statistically significant or not. Regardless, given the overlap in the boxplots, it is unlikely that the effect will be large. But, we have inferential tests (like the t-test!) to objectively determine this.
#custom_colors <- c("orange", "#377eb8") #for color-friendly option
g5 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords, fill = Prompt)) +
geom_boxplot() +
scale_fill_manual(values = custom_colors)
#print(g5)
A second (arguably way cooler) way to visualize the data is with violin plots. A violin plot is similar to a boxplot except that the distribution of the data is represented more precisely. If you look at one side of the violin plot (and rotate it 90 degrees) it will resemble the density plots that we made above. In the plot below, we have created boxplots inside the violin plots.
#custom_colors <- c("orange", "#377eb8") #for color-friendly option
g6 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords)) +
geom_violin(aes(color = Prompt)) +
geom_boxplot(aes(fill = Prompt), width = .2) +
scale_color_manual(values = custom_colors) +
scale_fill_manual(values = c("white", "white"))
#print(g6)
#### Conducting (and interpreting) an independent samples t-test
Next, we will run an independent samples t-test. Remember that if you want to learn the ins and outs of what a function in R can do, you can always use “?function_name” or “help(function_name”).
The results indicate that there is a significant difference (p = .049) between the two prompts with regard to number of words per essay.
t.test(Nwords~Prompt, alternative = 'two.sided', conf.level = .95, var.equal = TRUE, data = mydata)
##
## Two Sample t-test
##
## data: Nwords by Prompt
## t = 1.9761, df = 478, p-value = 0.04872
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## 0.0795708 28.0787625
## sample estimates:
## mean in group A mean in group B
## 324.6917 310.6125
Next, we will check the effect size to examine how large the difference in number of words are between the two prompts (while taking into account the variance in scores across the samples). To do so, we will use the cohen.d() function in the psych() package, which calculates the effect size measure d. d indicates the difference between the mean scores of two groups. d values represent this difference in terms of the pooled standard deviation. Accordingly, if d = .5, then the two means differ by .5 standard deviations.
#library(psych)
cohen.d(mydata,"Prompt") #this will generate results for all variables in the data
## Call: cohen.d(x = mydata, group = "Prompt")
## Cohen d statistic of difference between two means
## lower effect upper
## Score -0.08 0.10 0.28
## Nwords -0.36 -0.18 0.00
## Frequency -0.22 -0.05 0.13
##
## Multivariate (Mahalanobis) distance between groups
## [1] 0.35
## r equivalent of difference between two means
## Score Nwords Frequency
## 0.05 -0.09 -0.02
The results indicate an effect of d = .18 (we can ignore the negative sign here), which means that the difference between the two means is .18 standard deviations. Cohen’s (1988) recommendations for interpreting the effect size measure d suggest that d values between .20 and .49 are “small”, values between .50 and .79 are “medium”, and values above .8 are “large”. This suggests that the difference between number of words between prompts is below the threshold for a “small” effect (I usually refer to this as a “negligible” effect).
One way to write up these results is as follows:
Results
The purpose of this study was to determine whether there was a difference in essay length between two timed writing prompts. Descriptive statistics for this analysis can be found in Table 1. A visualization of the results can be found in Figure 1.
Table 1.
Number of words per essay in each prompt
Prompt | n | Mean | Standard Deviation |
---|---|---|---|
Prompt A | 240 | 324.69 | 79.33 |
Prompt B | 240 | 310.61 | 76.75 |
Full Dataset | 480 | 317.65 | 78.28 |
Figure 1. Box plot indicating the number of words per essay in each prompt
Assumptions for an independent samples t-test were conducted. A visual inspection of density plots indicated that the distribution of the data was roughly normal. A visual inspection of density plots amd box plots suggested that the variance was equal across the two prompts. A Levene’s test confirmed that there were no significant differences between the variance in each prompt. An independent samples t-test was then conducted to determine whether there were differences in the number of words per essay across the two essay prompts. The results of the t-test indicated that there was a significant (p = .049) but negligible (d = .18) difference in the number of words per essay across the two prompts. The full results can be found in Table 2.
Table 2. Results of the independent samples t-test
Variable | n | df | t | p | d |
---|---|---|---|---|---|
number of words per essay | 480 | 478 | 1.976 | 0.049 | .18 |
If your data is not normally distributed, you shouldn’t use a parametric test (such as the t-test we conducted above). Instead, you should use a non-parametric independent samples t-test, such as the Mann-Whitney U test (also called a Wilcoxon rank-sum test).
Note that if your variance is roughly equal, this will test whether the medians in the two groups are equal. If your variance is not roughly equal, then this will test whether the distributions are equal. In our case, the variance IS equal (according to both visual inspection and Levene’s test), so we can interpret the results in a similar manner as an independent samples t-test.
Conducting the Mann-Whitney U test is straightforward in R. Note that your continuous variable should come first (e.g., Nwords), followed by your categorical variable (e.g., Prompt).
wilcox.test(Nwords~Prompt,data = mydata)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Nwords by Prompt
## W = 32590, p-value = 0.01263
## alternative hypothesis: true location shift is not equal to 0
The test indicated that there is a significant difference in number of words between essays written in response to Prompt A and essays written in response to Prompt B (p = .013).
Below, we will calculate the effect size. In this case, we will use the measure r, which can be interpreted using the same guidelines as correlations (Cohen, 1988 suggests that .100 = small, .300 = medium, .600 = large)
#install.packages("rcompanion") #don't forget to install if you haven't already
library(rcompanion)
## Warning: package 'rcompanion' was built under R version 4.1.2
##
## Attaching package: 'rcompanion'
## The following object is masked from 'package:psych':
##
## phi
wilcoxonR(mydata$Nwords,mydata$Prompt)
## r
## 0.114
As with our previous calculations, the effect size is quite low (but in this case, meets the threshold for a “small” effect)
Now, determine whether there are differences between prompts for Frequency. Don’t forget to check for assumptions!