The objectives of this tutorial are to:
In research, we often want to know whether groups differ with regard to a particular characteristic. For example, in Tutorial 2 we looked at how particular classes of vehicles differed with regard to highway fuel efficiency. In the dataset that is visualized below, we can see (for example) that there doesn’t seem to be much of a difference in highway fuel efficiency between compact and midsize vehicles. There does, however, seem to be a reasonably large difference in highway fuel efficiency between midsize vehicles and pickups.
library(ggplot2) #import ggplot2
ggplot(data = mpg) + # create plot using the mpg data frame
geom_boxplot(mapping = aes(x = class, y = hwy)) #create boxplots for each vehicle class based on highway fuel efficiency
While visualizing data can help us identify general trends, the use of inferential statistics (such as T-tests) helps us to formally (and more precisely) discuss relationships in the data (such as differences between groups). When conducting inferential statistics, we use two primary metrics for measuring relationships in a data set, namely p-values and effect sizes. These concepts were introduced in Tutorial 3, but will be revisited (and in some cases expanded on) below. For now, our discussions will be limited to measuring differences in a particular characteristic across two groups (e.g., differences in highway fuel efficiency across midsize vehicles and pickups)
When using an inferential statistic to determine whether a difference exists between two groups, it is common practice to use a null hypothesis, which essentially posits that there is no difference between the two groups (e.g., two types of vehicles) with regard to the characteristic of interest (e.g., highway fuel efficiency). The null hypothesis is used purely for statistical reasons - difference tests (such as the ones we will cover in this tutorial) are designed to measure the probability that two sample distributions could actually represent the same distribution. In most cases, researchers presume that there ARE some differences between the groups (this is called the alternative hypothesis), and hope that they can reject the null hypothesis.
As alluded to in the previous paragraph, a probability value (p-value) indicates the probability that two distributions of data (e.g., highway fuel efficiency for midsize vehicles and highway fuel efficiency for pickups) could actually come from the same distribution. If that probability is at or below a particular threshold (referred to as the alpha or α value), we consider the two distributions to be significantly different. In social science research, we commonly set α to p = .05. As a reminder, p = .05 means that there is only a 5% chance that the two observed distributions could actually come from the same distribution. In less precise, but perhaps easier to grasp terms, p = .05 means that there is only a 5% chance that there are no differences across two groups with regard to a particular characteristic.
It is important to note that the larger our samples are, the more sure we can be about how well the distributions (and specifically, the mean score) in our samples reflect the overall population distribution. This means that the larger our studies are (with regard to participants and/or observations), the more likely we are to get a small p value. In other words, there is a strong link between sample size and p value.
Accordingly, we also need to use a way of measuring the differences of two groups that is independent of sample size. We refer to this type of measurement as an effect size.
An effect size is a sample size independent measure of the relationship between variables. If we are looking at differences between two independent samples using a T-test (as we will later in this tutorial), the effect size indicates the difference in mean scores between two groups. Effect sizes are almost always standardized scores, which means that they can be compared across studies.
While p values indicate how certain we can be about the observed differences between two groups, effect sizes tell us how big the differences actually are. If we are analyzing a small data set, it is entirely possibly for an observed difference between groups to fail to reach statistical significance while demonstrating a large effect. Conversely, if we are analyzing a large data set, it is possible to obtain a tiny p value (i.e., find that there is almost no chance that two distributions actually represent the same distribution) while also finding that the effect size (the size of the difference) is negligible.
In applied linguistics research, we sometimes want to know whether two independent groups (e.g., intact classes) differ with regard to some measure (motivation, vocabulary knowledge, writing skill, etc.). We also often want to know whether one teaching method works better than another method with regard to some outcome (e.g., vocabulary test score, writing quality score, etc.). In order to determine whether two groups differ in some regard (i.e., to address the first issue outlined above), we can use an independent samples t-test (for parametric data) or a Wilcoxon test (for non-parametric data). In order to determine whether one teaching method works better than another we will need a different set of statistical tests (stayed tuned!), but we could use an independent samples t-test to determine whether two groups were different with regard to some variable prior to testing a teaching method.
In this tutorial, we will be looking at argumentative essays written in response to two argumentative prompts (one about smoking in public places and the second about whether college students should have part time jobs). Specifically, we will be determining the degree to which either prompt tends to elicit longer essays (measured in number of words per essay). In short, we will be addressing the following research question:
Do the responses to the two essay prompts (prompt A and prompt B) differ with regard to number of words?
Our null hypothesis will be that there is no difference in number of words between the two prompts.
Independent samples t-tests are rather simple tests that use the sample means and the variance in each sample to determine the probability that the two samples are part of the same population.
Following are the assumptions for an independent samples t-test:
First, we will load some data and check assumptions. You can download the data here. Note that if the file doesn’t download directly you can right click on your browser window and choose the “save as” option to save the file on your computer.
mydata <- read.csv("data/distribution_sample.csv", header = TRUE) #this presumes that we have a folder in our working directory named "data" and that we have a file named "distribution sample" in that folder.
summary(mydata)
## Prompt Score Nwords Frequency
## Length:480 Min. :1.000 Min. : 61.0 Min. :2.963
## Class :character 1st Qu.:3.000 1st Qu.:273.0 1st Qu.:3.187
## Mode :character Median :3.500 Median :321.0 Median :3.237
## Mean :3.427 Mean :317.7 Mean :3.234
## 3rd Qu.:4.000 3rd Qu.:355.2 3rd Qu.:3.284
## Max. :5.000 Max. :586.0 Max. :3.489
In addition to visualizing data, getting descriptive statistics can be helpful in understanding the nature of your dataset. Below we get descriptive statistics for our entire dataset, then we split our dataset into two prompts, and get descriptives statistics for each.
#The first time you use the psych and dplyr libraries, you will need to install them:
#install.packages("psych")
#install.packages("dplyr")
library(psych) #many useful functions
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(dplyr) #helps us sort and filter datasets
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
describe(mydata) #get descriptive statistics
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 480 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00 0.00
## Score 2 480 3.43 0.89 3.50 3.43 0.74 1.00 5.00 4.00 0.00
## Nwords 3 480 317.65 78.28 321.00 316.93 57.82 61.00 586.00 525.00 0.12
## Frequency 4 480 3.23 0.07 3.24 3.23 0.07 2.96 3.49 0.53 -0.10
## kurtosis se
## Prompt* -2.00 0.02
## Score -0.62 0.04
## Nwords 1.10 3.57
## Frequency 0.54 0.00
Prompt A
#create a new dataframe that includes only responses to Prompt A:
promptA <- mydata %>% filter(Prompt == "A")
describe(promptA)
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.38 0.86 3.50 3.39 0.74 1.00 5.00 4.00 0.06
## Nwords 3 240 324.69 79.33 325.00 325.08 57.82 61.00 558.00 497.00 -0.06
## Frequency 4 240 3.24 0.07 3.24 3.24 0.07 2.96 3.45 0.49 -0.16
## kurtosis se
## Prompt* NaN 0.00
## Score -0.67 0.06
## Nwords 1.01 5.12
## Frequency 0.42 0.00
PromptB
##create a new dataframe that includes only responses to Prompt b:
promptB <- mydata %>% filter(Prompt == "B")
describe(promptB)
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.47 0.91 3.50 3.48 0.74 1.00 5.00 4.00 -0.07
## Nwords 3 240 310.61 76.75 313.00 308.98 59.30 86.00 586.00 500.00 0.30
## Frequency 4 240 3.23 0.08 3.23 3.23 0.07 2.98 3.49 0.51 -0.04
## kurtosis se
## Prompt* NaN 0.00
## Score -0.60 0.06
## Nwords 1.34 4.95
## Frequency 0.56 0.00
First, we will visually inspect the data for normality using density plots. We can either display these as two plots (using facet_wrap()) or as a single plot.
library(ggplot2)
ggplot(mydata, aes(x=Nwords, color = Prompt)) +
geom_density() +
facet_wrap(~Prompt)