Differences between two independent samples

Tutorial objectives

The objectives of this tutorial are to:

Learn what difference tests measure
Revisit important statistical terms
- p values
- effect sizes
Continue to build proficiency with creating (and interpreting) data visualizations
- density plots
- box plots
- violin plots (new to this tutorial)
Learn the assumptions of an independent samples t-test
Learn how to test the assumptions of an independent samples t-test
Learn how to interpret the output of an independent samples t-test (p values)
Learn how to calculate and interpret the effect size for an independent samples t-test
Learn how to determine whether differences exist between two independent groups when some of the assumptions of an independent t-test are not met.

Measuring differences between two independent samples (t-test, Wilcoxon test)

In research, we often want to know whether groups differ with regard to a particular characteristic. For example, in Tutorial 2 we looked at how particular classes of vehicles differed with regard to highway fuel efficiency. In the dataset that is visualized below, we can see (for example) that there doesn’t seem to be much of a difference in highway fuel efficiency between compact and midsize vehicles. There does, however, seem to be a reasonably large difference in highway fuel efficiency between midsize vehicles and pickups.

library(ggplot2) #import ggplot2

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'

library(viridis) #color-friendly palettes

## Loading required package: viridisLite

g1 <- ggplot(data = mpg) + # create plot using the mpg data frame
  geom_boxplot(mapping = aes(x = class, y = hwy, fill=class)) + #create boxplots for each vehicle class based on highway fuel efficiency
  scale_fill_viridis(discrete = TRUE)

#print(g1)

Boxplot displaying highway fuel efficiency across different vehicle classes. Each box represents the interquartile range of efficiency, with lines extending to the highest and lowest observed values and outliers shown as individual black dots.

While visualizing data can help us identify general trends, the use of inferential statistics (such as T-tests) helps us to formally (and more precisely) discuss relationships in the data (such as differences between groups). When conducting inferential statistics, we use two primary metrics for measuring relationships in a data set, namely p-values and effect sizes. These concepts were introduced in Tutorial 3, but will be revisited (and in some cases expanded on) below. For now, our discussions will be limited to measuring differences in a particular characteristic across two groups (e.g., differences in highway fuel efficiency across midsize vehicles and pickups)

probability (p) values revisited

When using an inferential statistic to determine whether a difference exists between two groups, it is common practice to use a null hypothesis, which essentially posits that there is no difference between the two groups (e.g., two types of vehicles) with regard to the characteristic of interest (e.g., highway fuel efficiency). The null hypothesis is used purely for statistical reasons - difference tests (such as the ones we will cover in this tutorial) are designed to measure the probability that two sample distributions could actually represent the same distribution. In most cases, researchers presume that there ARE some differences between the groups (this is called the alternative hypothesis), and hope that they can reject the null hypothesis.

As alluded to in the previous paragraph, a probability value (p-value) indicates the probability that two distributions of data (e.g., highway fuel efficiency for midsize vehicles and highway fuel efficiency for pickups) could actually come from the same distribution. If that probability is at or below a particular threshold (referred to as the alpha or α value), we consider the two distributions to be significantly different. In social science research, we commonly set α to p = .05. As a reminder, p = .05 means that there is only a 5% chance that the two observed distributions could actually come from the same distribution. In less precise, but perhaps easier to grasp terms, p = .05 means that there is only a 5% chance that there are no differences across two groups with regard to a particular characteristic.

It is important to note that the larger our samples are, the more sure we can be about how well the distributions (and specifically, the mean score) in our samples reflect the overall population distribution. This means that the larger our studies are (with regard to participants and/or observations), the more likely we are to get a small p value. In other words, there is a strong link between sample size and p value.

Accordingly, we also need to use a way of measuring the differences of two groups that is independent of sample size. We refer to this type of measurement as an effect size.

effect sizes revisited

An effect size is a sample size independent measure of the relationship between variables. If we are looking at differences between two independent samples using a T-test (as we will later in this tutorial), the effect size indicates the difference in mean scores between two groups. Effect sizes are almost always standardized scores, which means that they can be compared across studies.

While p values indicate how certain we can be about the observed differences between two groups, effect sizes tell us how big the differences actually are. If we are analyzing a small data set, it is entirely possibly for an observed difference between groups to fail to reach statistical significance while demonstrating a large effect. Conversely, if we are analyzing a large data set, it is possible to obtain a tiny p value (i.e., find that there is almost no chance that two distributions actually represent the same distribution) while also finding that the effect size (the size of the difference) is negligible.

Differences between two independent groups in Applied Linguistics research

In applied linguistics research, we sometimes want to know whether two independent groups (e.g., intact classes) differ with regard to some measure (motivation, vocabulary knowledge, writing skill, etc.). We also often want to know whether one teaching method works better than another method with regard to some outcome (e.g., vocabulary test score, writing quality score, etc.). In order to determine whether two groups differ in some regard (i.e., to address the first issue outlined above), we can use an independent samples t-test (for parametric data) or a Wilcoxon test (for non-parametric data). In order to determine whether one teaching method works better than another we will need a different set of statistical tests (stayed tuned!), but we could use an independent samples t-test to determine whether two groups were different with regard to some variable prior to testing a teaching method.

In this tutorial, we will be looking at argumentative essays written in response to two argumentative prompts (one about smoking in public places and the second about whether college students should have part time jobs). Specifically, we will be determining the degree to which either prompt tends to elicit longer essays (measured in number of words per essay). In short, we will be addressing the following research question:

Do the responses to the two essay prompts (prompt A and prompt B) differ with regard to number of words?

Our null hypothesis will be that there is no difference in number of words between the two prompts.

Conducting an independent samples t-test: Assumptions

Independent samples t-tests are rather simple tests that use the sample means and the variance in each sample to determine the probability that the two samples are part of the same population.

Following are the assumptions for an independent samples t-test:

Each sample is normally distributed
The variance is roughly equal across samples
The groups are independent (e.g., the data were not generated from the same individuals)
The data is continuous
There is only one comparison (an ANOVA is appropriate for multiple comparisons, stay tuned)

Checking assumptions

First, we will load some data and check assumptions. You can download the data here. Note that if the file doesn’t download directly you can right click on your browser window and choose the “save as” option to save the file on your computer.

mydata <- read.csv("data/distribution_sample.csv", header = TRUE) #this presumes that we have a folder in our working directory named "data" and that we have a file named "distribution sample" in that folder.
summary(mydata)

##     Prompt              Score           Nwords        Frequency    
##  Length:480         Min.   :1.000   Min.   : 61.0   Min.   :2.963  
##  Class :character   1st Qu.:3.000   1st Qu.:273.0   1st Qu.:3.187  
##  Mode  :character   Median :3.500   Median :321.0   Median :3.237  
##                     Mean   :3.427   Mean   :317.7   Mean   :3.234  
##                     3rd Qu.:4.000   3rd Qu.:355.2   3rd Qu.:3.284  
##                     Max.   :5.000   Max.   :586.0   Max.   :3.489

Getting descriptive statistics

In addition to visualizing data, getting descriptive statistics can be helpful in understanding the nature of your dataset. Below we get descriptive statistics for our entire dataset, then we split our dataset into two prompts, and get descriptives statistics for each.

#The first time you use the psych and dplyr libraries, you will need to install them:
#install.packages("psych")
#install.packages("dplyr")
library(psych) #many useful functions

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(dplyr) #helps us sort and filter datasets

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

describe(mydata) #get descriptive statistics

##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 480   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.00  0.00
## Score        2 480   3.43  0.89   3.50    3.43  0.74  1.00   5.00   4.00  0.00
## Nwords       3 480 317.65 78.28 321.00  316.93 57.82 61.00 586.00 525.00  0.12
## Frequency    4 480   3.23  0.07   3.24    3.23  0.07  2.96   3.49   0.53 -0.10
##           kurtosis   se
## Prompt*      -2.00 0.02
## Score        -0.62 0.04
## Nwords        1.10 3.57
## Frequency     0.54 0.00

Prompt A

#create a new dataframe that includes only responses to Prompt A:
promptA <- mydata %>% filter(Prompt == "A")
describe(promptA)

##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 240   1.00  0.00   1.00    1.00  0.00  1.00   1.00   0.00   NaN
## Score        2 240   3.38  0.86   3.50    3.39  0.74  1.00   5.00   4.00  0.06
## Nwords       3 240 324.69 79.33 325.00  325.08 57.82 61.00 558.00 497.00 -0.06
## Frequency    4 240   3.24  0.07   3.24    3.24  0.07  2.96   3.45   0.49 -0.16
##           kurtosis   se
## Prompt*        NaN 0.00
## Score        -0.67 0.06
## Nwords        1.01 5.12
## Frequency     0.42 0.00

PromptB

##create a new dataframe that includes only responses to Prompt b:
promptB <- mydata %>% filter(Prompt == "B")
describe(promptB)

##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 240   1.00  0.00   1.00    1.00  0.00  1.00   1.00   0.00   NaN
## Score        2 240   3.47  0.91   3.50    3.48  0.74  1.00   5.00   4.00 -0.07
## Nwords       3 240 310.61 76.75 313.00  308.98 59.30 86.00 586.00 500.00  0.30
## Frequency    4 240   3.23  0.08   3.23    3.23  0.07  2.98   3.49   0.51 -0.04
##           kurtosis   se
## Prompt*        NaN 0.00
## Score        -0.60 0.06
## Nwords        1.34 4.95
## Frequency     0.56 0.00

Step 1: Check for normality:

First, we will visually inspect the data for normality using density plots. We can either display these as two plots (using facet_wrap()) or as a single plot.

library(ggplot2)

g2 <- ggplot(mydata, aes(x=Nwords, color = Prompt)) + 
  geom_density() +
  facet_wrap(~Prompt) +
  scale_color_viridis(discrete = TRUE)

#print(g2)

Density plots showing the distribution of word counts across different prompts: Prompt A in purple and Prompt B in yellow.

g3 <- ggplot(mydata, aes(x=Nwords, color=Prompt, fill=Prompt)) +
  geom_density(alpha=0.4) +
  scale_color_viridis(discrete=TRUE) + 
  scale_fill_viridis(discrete=TRUE) 

#print(g3)

Overlaid density plots displaying word count distributions for two prompts. Prompt A in purple and Prompt B in yellow, both showing tails to the left and right of the peaks.

Either way, we observe that the two datasets roughly (but certainly not perfectly) represent a normal distribution.

We can also use the (rather stringent) Shapiro-Wilk test on each dataset (Prompt A and Prompt B). As we see below, the Shapiro-Wilk test indicates that the data from both prompts significantly vary from a normal distribution.

#Test normality for Nwords in PromptA
shapiro.test(promptA$Nwords) #p = 0.001872

## 
##  Shapiro-Wilk normality test
## 
## data:  promptA$Nwords
## W = 0.98008, p-value = 0.001872

#Test normality for Nwords in PromptB
shapiro.test(promptB$Nwords) #p = 0.0005323

## 
##  Shapiro-Wilk normality test
## 
## data:  promptB$Nwords
## W = 0.9766, p-value = 0.0005323

Step 2: Checking the assumption of equal variance

Much like the assumption of normalilty, we can check the assumption of equal variance (usually referred to as “homogeneity of variance”) both visually and with a statistical test (e.g., Levene’s test).

We can get an idea of the variance in distribution plots, but one of the the clearest ways to examine the variance is using a boxplot. Below, we see that the variance appears to be similar across the two prompts. (Note, the boxes represent the middle 50% of the data, the line within each box represents the median value. The boxes are roughly the same size, which indicates that the variance is roughly equal).

custom_colors <- c("orange", "#377eb8") #for color-friendly option

g4 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords, fill = Prompt)) +
  geom_boxplot() +
  scale_fill_manual(values = custom_colors)

#print(g4)

Boxplots showing word counts for two prompts. Prompt A in orange and Prompt B in blue. Each boxplot central line indicates the median word count, with the box itself showing the interquartile range. Whiskers mark the range within 1.5 times the interquartile range, excluding outliers, which are plotted as black dots beyond the whiskers.

In addition to visualizing our data, we can run Levene’s test, which is available via the car() package (don’t forget to install it if you haven’t installed it yet). The results below indicate that the variance in Nwords score across the two Prompts in our dataset is not significantly different (p = 0.769). In other words, we very clearly meet the assumption of equal variance.

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(Nwords ~ Prompt, mydata) #the syntax here is variable, grouping variable, dataframe

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0866 0.7687
##       478

Step 3: Other assumptions

Lets revisit our assumptions (and whether or not we meet them):

Each sample is normally distributed (visually the data approaches a normal distribution, but the Shapiro-Wilks test indicates that it is not strictly normal - we could make an argument for this either way)
The variance is roughly equal across samples (both visual inspection and Levene’s test indicates that the variance is roughly equal)
The data do not represent repeated measures (our data are not repeated measures - each essay was written by a different individual)
The data is continuous (the variable Nwords is indeed continuous)
There is only one comparison (Yes, we are only looking at difference in Nwords across Prompt)

So, we meet all assumptions except (possibly) the assumption of normality. Below, we will see what to do if we meet all assumptions, and an alternative test we can use if we don’t meet the assumption of normality.

Examining differences between two independent samples: Independent samples t-test

If our data meets the assumptions of a t-test, then we can use the t-test to examine differences between two independent groups (e.g., to determine whether there are differences in essay length based on prompt). Our first step is to visualize the data.

Visualizing two groups

The prototypical plot used to examine two independent groups is the boxplot. We already made one above, but we will repeat it here for good measure :).

Based on the boxplots, we see that the median number of words in Prompt A score is slightly higher than the median number of words in Prompt B, though it is unclear whether this difference will be statistically significant or not. Regardless, given the overlap in the boxplots, it is unlikely that the effect will be large. But, we have inferential tests (like the t-test!) to objectively determine this.

#custom_colors <- c("orange", "#377eb8") #for color-friendly option

g5 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords, fill = Prompt)) +
  geom_boxplot() +
  scale_fill_manual(values = custom_colors)

#print(g5)

A second (arguably way cooler) way to visualize the data is with violin plots. A violin plot is similar to a boxplot except that the distribution of the data is represented more precisely. If you look at one side of the violin plot (and rotate it 90 degrees) it will resemble the density plots that we made above. In the plot below, we have created boxplots inside the violin plots.

#custom_colors <- c("orange", "#377eb8") #for color-friendly option

g6 <- ggplot(data = mydata, aes(x = Prompt, y = Nwords)) + 
  geom_violin(aes(color = Prompt)) +
  geom_boxplot(aes(fill = Prompt), width = .2) +
  scale_color_manual(values = custom_colors) +
  scale_fill_manual(values = c("white", "white")) 

#print(g6)

Violin plots overlaid with boxplots representing word counts for two prompts. The violin plot for Prompt A, outlined in orange, and for Prompt B in blue, indicating the density distribution of word counts. The central white boxplot for each prompt shows the median and interquartile range. Outliers are represented by black dots beyond the whiskers of the boxplots. #### Conducting (and interpreting) an independent samples t-test

Next, we will run an independent samples t-test. Remember that if you want to learn the ins and outs of what a function in R can do, you can always use “?function_name” or “help(function_name”).

The results indicate that there is a significant difference (p = .049) between the two prompts with regard to number of words per essay.

t.test(Nwords~Prompt, alternative = 'two.sided', conf.level = .95, var.equal = TRUE, data = mydata)

## 
##  Two Sample t-test
## 
## data:  Nwords by Prompt
## t = 1.9761, df = 478, p-value = 0.04872
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##   0.0795708 28.0787625
## sample estimates:
## mean in group A mean in group B 
##        324.6917        310.6125

Next, we will check the effect size to examine how large the difference in number of words are between the two prompts (while taking into account the variance in scores across the samples). To do so, we will use the cohen.d() function in the psych() package, which calculates the effect size measure d. d indicates the difference between the mean scores of two groups. d values represent this difference in terms of the pooled standard deviation. Accordingly, if d = .5, then the two means differ by .5 standard deviations.

#library(psych)
cohen.d(mydata,"Prompt") #this will generate results for all variables in the data

## Call: cohen.d(x = mydata, group = "Prompt")
## Cohen d statistic of difference between two means
##           lower effect upper
## Score     -0.08   0.10  0.28
## Nwords    -0.36  -0.18  0.00
## Frequency -0.22  -0.05  0.13
## 
## Multivariate (Mahalanobis) distance between groups
## [1] 0.35
## r equivalent of difference between two means
##     Score    Nwords Frequency 
##      0.05     -0.09     -0.02

The results indicate an effect of d = .18 (we can ignore the negative sign here), which means that the difference between the two means is .18 standard deviations. Cohen’s (1988) recommendations for interpreting the effect size measure d suggest that d values between .20 and .49 are “small”, values between .50 and .79 are “medium”, and values above .8 are “large”. This suggests that the difference between number of words between prompts is below the threshold for a “small” effect (I usually refer to this as a “negligible” effect).

Writing up an independent samples t-test result

One way to write up these results is as follows:

Results

The purpose of this study was to determine whether there was a difference in essay length between two timed writing prompts. Descriptive statistics for this analysis can be found in Table 1. A visualization of the results can be found in Figure 1.

Table 1.
Number of words per essay in each prompt

Prompt	n	Mean	Standard Deviation
Prompt A	240	324.69	79.33
Prompt B	240	310.61	76.75
Full Dataset	480	317.65	78.28

Figure 1. Box plot indicating the number of words per essay in each prompt

Assumptions for an independent samples t-test were conducted. A visual inspection of density plots indicated that the distribution of the data was roughly normal. A visual inspection of density plots amd box plots suggested that the variance was equal across the two prompts. A Levene’s test confirmed that there were no significant differences between the variance in each prompt. An independent samples t-test was then conducted to determine whether there were differences in the number of words per essay across the two essay prompts. The results of the t-test indicated that there was a significant (p = .049) but negligible (d = .18) difference in the number of words per essay across the two prompts. The full results can be found in Table 2.

Table 2. Results of the independent samples t-test

Variable	n	df	t	p	d
number of words per essay	480	478	1.976	0.049	.18

Mann-Whitney U test (Wilcoxon rank-sum test)

If your data is not normally distributed, you shouldn’t use a parametric test (such as the t-test we conducted above). Instead, you should use a non-parametric independent samples t-test, such as the Mann-Whitney U test (also called a Wilcoxon rank-sum test).

Note that if your variance is roughly equal, this will test whether the medians in the two groups are equal. If your variance is not roughly equal, then this will test whether the distributions are equal. In our case, the variance IS equal (according to both visual inspection and Levene’s test), so we can interpret the results in a similar manner as an independent samples t-test.

Conducting the Mann-Whitney U test is straightforward in R. Note that your continuous variable should come first (e.g., Nwords), followed by your categorical variable (e.g., Prompt).

wilcox.test(Nwords~Prompt,data = mydata)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Nwords by Prompt
## W = 32590, p-value = 0.01263
## alternative hypothesis: true location shift is not equal to 0

The test indicated that there is a significant difference in number of words between essays written in response to Prompt A and essays written in response to Prompt B (p = .013).

Below, we will calculate the effect size. In this case, we will use the measure r, which can be interpreted using the same guidelines as correlations (Cohen, 1988 suggests that .100 = small, .300 = medium, .600 = large)

#install.packages("rcompanion") #don't forget to install if you haven't already
library(rcompanion)

## Warning: package 'rcompanion' was built under R version 4.1.2

## 
## Attaching package: 'rcompanion'

## The following object is masked from 'package:psych':
## 
##     phi

wilcoxonR(mydata$Nwords,mydata$Prompt)

##     r 
## 0.114

As with our previous calculations, the effect size is quite low (but in this case, meets the threshold for a “small” effect)

Time for practice!

Now, determine whether there are differences between prompts for Frequency. Don’t forget to check for assumptions!

Two independent samples

Kristopher Kyle

last updated 2023-01-26