Back to Homepage

# Differences between two independent samples

## Tutorial objectives

The objectives of this tutorial are to:

• Learn what difference tests measure
• Revisit important statistical terms
• p values
• effect sizes
• Continue to build proficiency with creating (and interpreting) data visualizations
• density plots
• box plots
• violin plots (new to this tutorial)
• Learn the assumptions of an independent samples t-test
• Learn how to test the assumptions of an independent samples t-test
• Learn how to interpret the output of an independent samples t-test (p values)
• Learn how to calculate and interpret the effect size for an independent samples t-test
• Learn how to determine whether differences exist between two independent groups when some of the assumptions of an independent t-test are not met.

## Measuring differences between two independent samples (t-test, Wilcoxon test)

In research, we often want to know whether groups differ with regard to a particular characteristic. For example, in Tutorial 2 we looked at how particular classes of vehicles differed with regard to highway fuel efficiency. In the dataset that is visualized below, we can see (for example) that there doesn’t seem to be much of a difference in highway fuel efficiency between compact and midsize vehicles. There does, however, seem to be a reasonably large difference in highway fuel efficiency between midsize vehicles and pickups.

``````library(ggplot2) #import ggplot2
ggplot(data = mpg) + # create plot using the mpg data frame
geom_boxplot(mapping = aes(x = class, y = hwy)) #create boxplots for each vehicle class based on highway fuel efficiency`````` While visualizing data can help us identify general trends, the use of inferential statistics (such as T-tests) helps us to formally (and more precisely) discuss relationships in the data (such as differences between groups). When conducting inferential statistics, we use two primary metrics for measuring relationships in a data set, namely p-values and effect sizes. These concepts were introduced in Tutorial 3, but will be revisited (and in some cases expanded on) below. For now, our discussions will be limited to measuring differences in a particular characteristic across two groups (e.g., differences in highway fuel efficiency across midsize vehicles and pickups)

### probability (p) values revisited

When using an inferential statistic to determine whether a difference exists between two groups, it is common practice to use a null hypothesis, which essentially posits that there is no difference between the two groups (e.g., two types of vehicles) with regard to the characteristic of interest (e.g., highway fuel efficiency). The null hypothesis is used purely for statistical reasons - difference tests (such as the ones we will cover in this tutorial) are designed to measure the probability that two sample distributions could actually represent the same distribution. In most cases, researchers presume that there ARE some differences between the groups (this is called the alternative hypothesis), and hope that they can reject the null hypothesis.

As alluded to in the previous paragraph, a probability value (p-value) indicates the probability that two distributions of data (e.g., highway fuel efficiency for midsize vehicles and highway fuel efficiency for pickups) could actually come from the same distribution. If that probability is at or below a particular threshold (referred to as the alpha or α value), we consider the two distributions to be significantly different. In social science research, we commonly set α to p = .05. As a reminder, p = .05 means that there is only a 5% chance that the two observed distributions could actually come from the same distribution. In less precise, but perhaps easier to grasp terms, p = .05 means that there is only a 5% chance that there are no differences across two groups with regard to a particular characteristic.

It is important to note that the larger our samples are, the more sure we can be about how well the distributions (and specifically, the mean score) in our samples reflect the overall population distribution. This means that the larger our studies are (with regard to participants and/or observations), the more likely we are to get a small p value. In other words, there is a strong link between sample size and p value.

Accordingly, we also need to use a way of measuring the differences of two groups that is independent of sample size. We refer to this type of measurement as an effect size.

### effect sizes revisited

An effect size is a sample size independent measure of the relationship between variables. If we are looking at differences between two independent samples using a T-test (as we will later in this tutorial), the effect size indicates the difference in mean scores between two groups. Effect sizes are almost always standardized scores, which means that they can be compared across studies.

While p values indicate how certain we can be about the observed differences between two groups, effect sizes tell us how big the differences actually are. If we are analyzing a small data set, it is entirely possibly for an observed difference between groups to fail to reach statistical significance while demonstrating a large effect. Conversely, if we are analyzing a large data set, it is possible to obtain a tiny p value (i.e., find that there is almost no chance that two distributions actually represent the same distribution) while also finding that the effect size (the size of the difference) is negligible.

## Differences between two independent groups in Applied Linguistics research

In applied linguistics research, we sometimes want to know whether two independent groups (e.g., intact classes) differ with regard to some measure (motivation, vocabulary knowledge, writing skill, etc.). We also often want to know whether one teaching method works better than another method with regard to some outcome (e.g., vocabulary test score, writing quality score, etc.). In order to determine whether two groups differ in some regard (i.e., to address the first issue outlined above), we can use an independent samples t-test (for parametric data) or a Wilcoxon test (for non-parametric data). In order to determine whether one teaching method works better than another we will need a different set of statistical tests (stayed tuned!), but we could use an independent samples t-test to determine whether two groups were different with regard to some variable prior to testing a teaching method.

In this tutorial, we will be looking at argumentative essays written in response to two argumentative prompts (one about smoking in public places and the second about whether college students should have part time jobs). Specifically, we will be determining the degree to which either prompt tends to elicit longer essays (measured in number of words per essay). In short, we will be addressing the following research question:

Do the responses to the two essay prompts (prompt A and prompt B) differ with regard to number of words?

Our null hypothesis will be that there is no difference in number of words between the two prompts.

### Conducting an independent samples t-test: Assumptions

Independent samples t-tests are rather simple tests that use the sample means and the variance in each sample to determine the probability that the two samples are part of the same population.

Following are the assumptions for an independent samples t-test:

• Each sample is normally distributed
• The variance is roughly equal across samples
• The groups are independent (e.g., the data were not generated from the same individuals)
• The data is continuous
• There is only one comparison (an ANOVA is appropriate for multiple comparisons, stay tuned)

### Checking assumptions

First, we will load some data and check assumptions. You can download the data here. Note that if the file doesn’t download directly you can right click on your browser window and choose the “save as” option to save the file on your computer.

``````mydata <- read.csv("data/distribution_sample.csv", header = TRUE) #this presumes that we have a folder in our working directory named "data" and that we have a file named "distribution sample" in that folder.
summary(mydata)``````
``````##     Prompt              Score           Nwords        Frequency
##  Length:480         Min.   :1.000   Min.   : 61.0   Min.   :2.963
##  Class :character   1st Qu.:3.000   1st Qu.:273.0   1st Qu.:3.187
##  Mode  :character   Median :3.500   Median :321.0   Median :3.237
##                     Mean   :3.427   Mean   :317.7   Mean   :3.234
##                     3rd Qu.:4.000   3rd Qu.:355.2   3rd Qu.:3.284
##                     Max.   :5.000   Max.   :586.0   Max.   :3.489``````

#### Getting descriptive statistics

In addition to visualizing data, getting descriptive statistics can be helpful in understanding the nature of your dataset. Below we get descriptive statistics for our entire dataset, then we split our dataset into two prompts, and get descriptives statistics for each.

``````#The first time you use the psych and dplyr libraries, you will need to install them:
#install.packages("psych")
#install.packages("dplyr")
library(psych) #many useful functions``````
``````##
## Attaching package: 'psych'``````
``````## The following objects are masked from 'package:ggplot2':
##
##     %+%, alpha``````
``library(dplyr) #helps us sort and filter datasets``
``````##
## Attaching package: 'dplyr'``````
``````## The following objects are masked from 'package:stats':
##
##     filter, lag``````
``````## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union``````
``describe(mydata) #get descriptive statistics``
``````##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 480   1.50  0.50   1.50    1.50  0.74  1.00   2.00   1.00  0.00
## Score        2 480   3.43  0.89   3.50    3.43  0.74  1.00   5.00   4.00  0.00
## Nwords       3 480 317.65 78.28 321.00  316.93 57.82 61.00 586.00 525.00  0.12
## Frequency    4 480   3.23  0.07   3.24    3.23  0.07  2.96   3.49   0.53 -0.10
##           kurtosis   se
## Prompt*      -2.00 0.02
## Score        -0.62 0.04
## Nwords        1.10 3.57
## Frequency     0.54 0.00``````

Prompt A

``````#create a new dataframe that includes only responses to Prompt A:
promptA <- mydata %>% filter(Prompt == "A")
describe(promptA)``````
``````##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 240   1.00  0.00   1.00    1.00  0.00  1.00   1.00   0.00   NaN
## Score        2 240   3.38  0.86   3.50    3.39  0.74  1.00   5.00   4.00  0.06
## Nwords       3 240 324.69 79.33 325.00  325.08 57.82 61.00 558.00 497.00 -0.06
## Frequency    4 240   3.24  0.07   3.24    3.24  0.07  2.96   3.45   0.49 -0.16
##           kurtosis   se
## Prompt*        NaN 0.00
## Score        -0.67 0.06
## Nwords        1.01 5.12
## Frequency     0.42 0.00``````

PromptB

``````##create a new dataframe that includes only responses to Prompt b:
promptB <- mydata %>% filter(Prompt == "B")
describe(promptB)``````
``````##           vars   n   mean    sd median trimmed   mad   min    max  range  skew
## Prompt*      1 240   1.00  0.00   1.00    1.00  0.00  1.00   1.00   0.00   NaN
## Score        2 240   3.47  0.91   3.50    3.48  0.74  1.00   5.00   4.00 -0.07
## Nwords       3 240 310.61 76.75 313.00  308.98 59.30 86.00 586.00 500.00  0.30
## Frequency    4 240   3.23  0.08   3.23    3.23  0.07  2.98   3.49   0.51 -0.04
##           kurtosis   se
## Prompt*        NaN 0.00
## Score        -0.60 0.06
## Nwords        1.34 4.95
## Frequency     0.56 0.00``````

#### Step 1: Check for normality:

First, we will visually inspect the data for normality using density plots. We can either display these as two plots (using facet_wrap()) or as a single plot.

``````library(ggplot2)
ggplot(mydata, aes(x=Nwords, color = Prompt)) +
geom_density() +
facet_wrap(~Prompt)``````