**The objectives of this tutorial are to:**

- Learn what difference tests measure
- Revisit important statistical terms
*p*values*effect sizes*

- Continue to build proficiency with creating (and interpreting) data
visualizations
- density plots
- box plots
- violin plots (new to this tutorial)

- Learn the assumptions of an independent samples t-test
- Learn how to test the assumptions of an independent samples t-test
- Learn how to interpret the output of an independent samples t-test
(
*p*values) - Learn how to calculate and interpret the effect size for an independent samples t-test
- Learn how to determine whether differences exist between two independent groups when some of the assumptions of an independent t-test are not met.

In research, we often want to know whether groups differ with regard to a particular characteristic. For example, in Tutorial 2 we looked at how particular classes of vehicles differed with regard to highway fuel efficiency. In the dataset that is visualized below, we can see (for example) that there doesn’t seem to be much of a difference in highway fuel efficiency between compact and midsize vehicles. There does, however, seem to be a reasonably large difference in highway fuel efficiency between midsize vehicles and pickups.

```
library(ggplot2) #import ggplot2
ggplot(data = mpg) + # create plot using the mpg data frame
geom_boxplot(mapping = aes(x = class, y = hwy)) #create boxplots for each vehicle class based on highway fuel efficiency
```

While visualizing data can help us identify general trends, the use
of inferential statistics (such as T-tests) helps us to formally (and
more precisely) discuss relationships in the data (such as differences
between groups). When conducting inferential statistics, we use two
primary metrics for measuring relationships in a data set, namely
**p-values** and **effect sizes**. These
concepts were introduced in Tutorial
3, but will be revisited (and in some cases expanded on) below. For
now, our discussions will be limited to measuring differences in a
particular characteristic across two groups (e.g., differences in
highway fuel efficiency across midsize vehicles and pickups)

When using an inferential statistic to determine whether a difference
exists between two groups, it is common practice to use a **null
hypothesis**, which essentially posits that there is no
difference between the two groups (e.g., two types of vehicles) with
regard to the characteristic of interest (e.g., highway fuel
efficiency). The **null hypothesis** is used purely for
statistical reasons - difference tests (such as the ones we will cover
in this tutorial) are designed to measure the probability that two
sample distributions could actually represent the same distribution. In
most cases, researchers presume that there ARE some differences between
the groups (this is called the **alternative hypothesis**),
and hope that they can reject the **null hypothesis**.

As alluded to in the previous paragraph, a probability value
(**p-value**) indicates the probability that two
distributions of data (e.g., highway fuel efficiency for midsize
vehicles and highway fuel efficiency for pickups) could actually come
from the same distribution. If that probability is at or below a
particular threshold (referred to as the **alpha** or
**α** value), we consider the two distributions to be
significantly different. In social science research, we commonly set
**α** to *p* = .05. As a reminder, *p* = .05
means that there is only a 5% chance that the two observed distributions
could actually come from the same distribution. In less precise, but
perhaps easier to grasp terms, *p* = .05 means that there is only
a 5% chance that there are no differences across two groups with regard
to a particular characteristic.

It is important to note that the larger our samples are, the more
sure we can be about how well the distributions (and specifically, the
mean score) in our samples reflect the overall population distribution.
This means that the larger our studies are (with regard to participants
and/or observations), the more likely we are to get a small *p*
value. In other words, there is a strong link between sample size and
*p* value.

Accordingly, we also need to use a way of measuring the differences
of two groups that is independent of sample size. We refer to this type
of measurement as an **effect size**.

An **effect size** is a sample size independent measure
of the relationship between variables. If we are looking at differences
between two independent samples using a T-test (as we will later in this
tutorial), the **effect size** indicates the difference in
mean scores between two groups. Effect sizes are almost always
standardized scores, which means that they can be compared across
studies.

While *p* values indicate how certain we can be about the
observed differences between two groups, effect sizes tell us how big
the differences actually are. If we are analyzing a small data set, it
is entirely possibly for an observed difference between groups to fail
to reach statistical significance while demonstrating a large effect.
Conversely, if we are analyzing a large data set, it is possible to
obtain a tiny *p* value (i.e., find that there is almost no
chance that two distributions actually represent the same distribution)
while also finding that the effect size (the size of the difference) is
negligible.

In applied linguistics research, we sometimes want to know whether two independent groups (e.g., intact classes) differ with regard to some measure (motivation, vocabulary knowledge, writing skill, etc.). We also often want to know whether one teaching method works better than another method with regard to some outcome (e.g., vocabulary test score, writing quality score, etc.). In order to determine whether two groups differ in some regard (i.e., to address the first issue outlined above), we can use an independent samples t-test (for parametric data) or a Wilcoxon test (for non-parametric data). In order to determine whether one teaching method works better than another we will need a different set of statistical tests (stayed tuned!), but we could use an independent samples t-test to determine whether two groups were different with regard to some variable prior to testing a teaching method.

In this tutorial, we will be looking at argumentative essays written in response to two argumentative prompts (one about smoking in public places and the second about whether college students should have part time jobs). Specifically, we will be determining the degree to which either prompt tends to elicit longer essays (measured in number of words per essay). In short, we will be addressing the following research question:

Do the responses to the two essay prompts (prompt A and prompt B) differ with regard to number of words?

Our **null hypothesis** will be that there is no
difference in number of words between the two prompts.

Independent samples t-tests are rather simple tests that use the sample means and the variance in each sample to determine the probability that the two samples are part of the same population.

Following are the assumptions for an independent samples t-test:

- Each sample is normally distributed
- The variance is roughly equal across samples
- The groups are independent (e.g., the data were not generated from the same individuals)
- The data is continuous
- There is only one comparison (an ANOVA is appropriate for multiple comparisons, stay tuned)

First, we will load some data and check assumptions. You can download the data here. Note that if the file doesn’t download directly you can right click on your browser window and choose the “save as” option to save the file on your computer.

```
mydata <- read.csv("data/distribution_sample.csv", header = TRUE) #this presumes that we have a folder in our working directory named "data" and that we have a file named "distribution sample" in that folder.
summary(mydata)
```

```
## Prompt Score Nwords Frequency
## Length:480 Min. :1.000 Min. : 61.0 Min. :2.963
## Class :character 1st Qu.:3.000 1st Qu.:273.0 1st Qu.:3.187
## Mode :character Median :3.500 Median :321.0 Median :3.237
## Mean :3.427 Mean :317.7 Mean :3.234
## 3rd Qu.:4.000 3rd Qu.:355.2 3rd Qu.:3.284
## Max. :5.000 Max. :586.0 Max. :3.489
```

In addition to visualizing data, getting descriptive statistics can be helpful in understanding the nature of your dataset. Below we get descriptive statistics for our entire dataset, then we split our dataset into two prompts, and get descriptives statistics for each.

```
#The first time you use the psych and dplyr libraries, you will need to install them:
#install.packages("psych")
#install.packages("dplyr")
library(psych) #many useful functions
```

```
##
## Attaching package: 'psych'
```

```
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
```

`library(dplyr) #helps us sort and filter datasets`

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`describe(mydata) #get descriptive statistics`

```
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 480 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00 0.00
## Score 2 480 3.43 0.89 3.50 3.43 0.74 1.00 5.00 4.00 0.00
## Nwords 3 480 317.65 78.28 321.00 316.93 57.82 61.00 586.00 525.00 0.12
## Frequency 4 480 3.23 0.07 3.24 3.23 0.07 2.96 3.49 0.53 -0.10
## kurtosis se
## Prompt* -2.00 0.02
## Score -0.62 0.04
## Nwords 1.10 3.57
## Frequency 0.54 0.00
```

**Prompt A**

```
#create a new dataframe that includes only responses to Prompt A:
promptA <- mydata %>% filter(Prompt == "A")
describe(promptA)
```

```
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.38 0.86 3.50 3.39 0.74 1.00 5.00 4.00 0.06
## Nwords 3 240 324.69 79.33 325.00 325.08 57.82 61.00 558.00 497.00 -0.06
## Frequency 4 240 3.24 0.07 3.24 3.24 0.07 2.96 3.45 0.49 -0.16
## kurtosis se
## Prompt* NaN 0.00
## Score -0.67 0.06
## Nwords 1.01 5.12
## Frequency 0.42 0.00
```

**PromptB**

```
##create a new dataframe that includes only responses to Prompt b:
promptB <- mydata %>% filter(Prompt == "B")
describe(promptB)
```

```
## vars n mean sd median trimmed mad min max range skew
## Prompt* 1 240 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## Score 2 240 3.47 0.91 3.50 3.48 0.74 1.00 5.00 4.00 -0.07
## Nwords 3 240 310.61 76.75 313.00 308.98 59.30 86.00 586.00 500.00 0.30
## Frequency 4 240 3.23 0.08 3.23 3.23 0.07 2.98 3.49 0.51 -0.04
## kurtosis se
## Prompt* NaN 0.00
## Score -0.60 0.06
## Nwords 1.34 4.95
## Frequency 0.56 0.00
```

First, we will visually inspect the data for normality using density plots. We can either display these as two plots (using facet_wrap()) or as a single plot.

```
library(ggplot2)
ggplot(mydata, aes(x=Nwords, color = Prompt)) +
geom_density() +
facet_wrap(~Prompt)
```