The objectives of this tutorial are to:
The data represents a sample of a simplified (and adapted) version of the meta data for the written section of the International Corpus Network of Asian Learners of English. This corpus includes argumentative essays written on two prompts by college students for a range of countries in Asia. As we see below, this file includes a number of variables including a participant identifier, nationality of the participant, their score on a vocabulary size test (VST), and their English proficiency.
In previous tutorials, we have used dataframes that are part of the R base package or are part of ggplot2. Here, we will load data from a spreadsheet that is formatted as a comma separated values (.csv) document.
Our first step will be to open a new R script and save it in a convenient location (e.g., in a folder named “Rday3” on your desktop).
Our second step will be to download the target .csv file, extract it, and save a copy of the target .csv file (“ICNALE_500_simple.csv”) in a folder called “data” in the Rday3 folder.
Third, with our new R script open in RStudio, we will click on “Session” in the toolbar and “Set Working Directory”” to “Source File Location”. This will let RStudio know where to look for your file(s).
Finally, we will load our target .csv file (“ICNALE_500_simple.csv”) as a dataframe (and in this case we will call that dataframe “fundata”).
fundata <- read.csv("data/ICNALE_500_simple.csv", header = TRUE) #this presumes that the .csv file is in a subfolder entitled "data"
summary(fundata)
## Participant Nationality VST Proficiency_CEFR
## Length:500 Length:500 Min. :10.00 Length:500
## Class :character Class :character 1st Qu.:24.00 Class :character
## Mode :character Mode :character Median :33.00 Mode :character
## Mean :32.56
## 3rd Qu.:41.00
## Max. :50.00
## Proficiency_Number
## Min. :1.00
## 1st Qu.:1.00
## Median :2.00
## Mean :2.24
## 3rd Qu.:3.00
## Max. :4.00
Based on the summary output, we see that our dataframe has four variables (Participant, Nationality, VST, Proficiency_CEFR, and Proficiency_Number).
These variable names represent the following: - Participant: Participant identifier - Nationality: Nationality of participant - VST: Score on receptive vocabulary size test - Proficiency_CEFR: Categorical proficiency rating based on the Common European Framework of Reference for Languges (CEFR) - Proficiency_Number: CEFR rating converted to a numerical scale (ranging from 1 to 4)
Lets apply what we learned in the last tutorial to this new data. Don’t forget to load ggplot2!
library(ggplot2)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
library(viridis) #color-friendly palettes
## Loading required package: viridisLite
First, lets create a scatterplot with a line of best fit to examine the relationship between VST (y-axis) and Proficiency_Number (x-axis). What does the scatterplot indicate about the relationship between VST scores and Proficiency?
g1 <- ggplot(data = fundata) +
geom_point(mapping = aes(x = Proficiency_Number, y = VST)) +
geom_smooth(mapping = aes(x = Proficiency_Number, y = VST), method = lm) +
scale_color_viridis(discrete = TRUE) +
labs(
x = "Proficiency_Number",
y = "VST",
title = "Relationship between Proficiency Number and VST")
#print(g1)
## `geom_smooth()` using formula 'y ~ x'
Now try to make the following plot, which shows the data by nationality (hint, I also used geom_jitter()).
g2<- ggplot(data = fundata) + #this line indicates my dataset
geom_jitter(mapping = aes(x = Proficiency_Number, y = VST, color = Nationality)) + #this line sets the x and y axis
geom_smooth(mapping = aes(x = Proficiency_Number, y = VST), method = lm) +
scale_color_viridis(discrete = TRUE) +
labs(
x = "Proficiency_Number",
y = "VST",
title = "Relationship between Proficiency Number and VST")
#print(g2)
## `geom_smooth()` using formula 'y ~ x'
Now, lets look at the trends by creating a different scatterplot for each (note that ncol = 3). What do these scatterplots tell us about the relationship between VST scores and Proficiency across L1 groups/learning environments?
g3 <- ggplot(data = fundata) + #this line indicates my dataset
geom_jitter(mapping = aes(x = Proficiency_Number, y = VST, color = Nationality)) + #this line sets the x and y axis
geom_smooth(mapping = aes(x = Proficiency_Number, y = VST), method = lm) +
facet_wrap(~ Nationality, ncol = 3) +
scale_color_viridis(discrete = TRUE) +
labs(
x = "Proficiency_Number",
y = "VST",
title = "Relationship between Proficiency Number and VST across Nationalities"
)
#print(g3)
## `geom_smooth()` using formula 'y ~ x'
Now, lets treat proficiency as a categorical variable. Create a graph with boxplots that shows the range of VST scores (y-axis) by Proficiency_CEFR (x-axis).
g4<- ggplot(data = fundata) +
geom_boxplot(mapping = aes(x = Proficiency_CEFR, y = VST)) +
labs(
x = "Proficiency_CEFR",
y = "VST",
title = "VST Distribution by Proficiency Level"
)
#print(g4)
Now, lets remove the outliers in our base plot, and add the actual data points to the boxplot (vary the points by nationality).
g5<- ggplot(data = fundata) +
geom_boxplot(mapping = aes(x = Proficiency_CEFR, y = VST), outlier.shape = NA) +
geom_jitter(mapping = aes(x = Proficiency_CEFR, y = VST, color = Nationality), width = 0.2)+
scale_color_viridis(discrete = TRUE)+
labs(
x = "Proficiency_CEFR",
y = "VST",
title = "VST Distribution by Proficiency Level")
#print(g5)
In this section, R code is provided to help demonstrate concepts, but don’t worry about the code. The main purpose of this section is to introduce the concepts (not the code). We will revisit some of the code in later tutorials as applicable.
To demonstrate the concepts of mean, median, and mode, we will create a short list of numbers:
samplelist <- c(1,1,2,3,4,4,5,5,6,6,7,7,7)
mean The mean is what the general population refers to as the “average” score.
To get the mean, we divide the sum of the observed values by the number of observations. In R, we can use the “mean()” function to calculate the mean score. As we see below, the mean value of our sample list is 4.461538.
mean(samplelist)
## [1] 4.461538
median The median is the middle observed values (when all values are arranged in ascending/descending order).
Our sample list is sorted in ascending order and includes 13 items. The middle observed value (the 7th value) is 5. In R we can use the “median()” function to calculate the median score.
median(samplelist)
## [1] 5
mode The mode is the observed value that occurs most often.
To get the mode, we can simply count the frequency of each item in the list. The most frequent item is the mode. In our sample list, 7 occurs more than any other number (in this case, 3 times) and is therefore the mode. Base R does not have a function to calculate the mode (though there are packages that can be used for this purpose). Because we won’t need to calculate the mode for any analyses this term, we won’t worry about this too much.
The definitions below may seem a bit technical. If that is the case, don’t worry too much. We will discuss these (particularly standard deviation) in class.
population variance The variance in a set of observed values indicate how these values are distributed. To calculate the variance, we first subtract the mean from each of our observed values and square each of these values. We then calculate the mean for the squared values.
population standard deviation The standard deviation is the square root of the variance.
sample variance When we calculate the variance for a sample, we need to account for the fact that smaller samples will not perfectly reflect the population. In this case, when we calculate the mean for the squared values, we have to subtract one from the denominator. In small samples, this will have a large effect, but in large samples, the difference will be minimal.
sample standard deviation The sample standard deviation is simply the square root of the sample variance.
Below, we will load our sample data, which comes from the ICNALE corpus. In this case, we will be looking specifically at Vocabulary Size Test (VST) scores. Then, we will look at some descriptive statistics related to VST scores in the dataset, including the mean, median, variance, and standard deviation.
fundata <- read.csv("data/ICNALE_500_simple.csv", header = TRUE) #load data
cat("mean VST score:", mean(fundata$VST),"\n") #mean VST score
## mean VST score: 32.562
cat("median VST score:", median(fundata$VST),"\n") #median VST score
## median VST score: 33
cat("variance in VST score", var(fundata$VST), "\n") #variance in VST scores - by default this is the sample variance
## variance in VST score 94.65146
cat("standard deviation of VST scores", sd(fundata$VST), "\n") #standard deviation of VST scores - by default this is the sample standard deviation
## standard deviation of VST scores 9.728898
As we can see, the mean score is around 32.5, the median score is similar (33), the variance is 94.65, and the standard deviation is around 9.7 (which,as a reminder, is the square root of the variance. Below, we will see some of the ways in which these statistics are useful.
Many statistical tests require data that is normally distributed. These statistical tests are referred to as “parametric” statistical tests. Below, we will see how to to determine the degree to which our data is normally distributed both visually and statistically.
In a normal distribution, the median and the mean values will be the same, and ~68% of the data points will be within one standard deviation of the mean. The figure below shows the shape of a normal distribution.
library(ggplot2)
#don't worry too much about this code - take a look at the plot
g6<- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm, n = 500, args = list(mean = 0, sd = 1)) + ylab("") + xlab("VST (but not really - this is too perfect)") +
scale_y_continuous(breaks = NULL)+
labs(
title = "Normal Distribution Curve"
)
#print(g6)
One way that we can check the degree to which our data is normally distributed is to check whether its distribution is similar to a perfectly normal distribution. In the plot below, the blue line represents a normal distribution with the mean and standard deviation of VST scores. The red line represents the actual distribution of VST scores.
#This code may be useful in the future, but don't worry about it too much right now.
custom_colors <- c("Density" = "#377eb8", "Normal Curve" = "orange")
g7<- ggplot(fundata, aes(x = VST)) +
geom_density(aes(color = "Density")) + #Assigning a name to the color aesthetic
# geom_histogram(aes(y = ..density.., fill = "Histogram"), binwidth = 3) +
stat_function(fun = dnorm, n = 500, args = list(mean = mean(fundata$VST), sd = sd(fundata$VST)), aes(color ="Normal Curve")) +
ylab("") +
xlab("VST") +
scale_y_continuous(breaks = NULL) +
scale_color_manual(values = custom_colors) + #scale_color_viridis(discrete = TRUE) is possible, but for better visibility
labs(
title = "Density and Normal Distribution of VST Scores"
)
#print(g7)
It appears as though our data is not normally distributed (the red line that represents the actual data doesn’t look very much like the blue line, which represents a normal distribution).
Another way we can visually check our distribution is to use qq plots. If our data is normally distributed, then our actual data points will fall along our theoretical line (see plot below).
#don't worry too much about this code for now
g8<- ggplot(fundata, (aes(sample=VST))) + #note that the sample is the variable we are examinining
stat_qq() + #this adds the points
stat_qq_line() + #this adds the theoretical line
labs(
title = "Q-Q Plot of VST")
#print(g8)
In this case, we see that many of our observed data points fall above or below the theoretical line. This is further indication that our data is not normally distributed
However, we can also test to see the degree to which the observed distribution differs from the theoretical one using the Shapiro-Wilk test. In this case, if the p value is small (e.g., below .05), then we can assume that our observed distribution significantly differs from a normal distribution (i.e., it is not normal). Note that R often uses scientific notation for very small numbers. If you see a number followed by “e” and a negative number n, then you simply move the decimal place to the left n times. For example, 5.151e-3 should be interpreted as .005151.
shapiro.test(fundata$VST)
##
## Shapiro-Wilk normality test
##
## data: fundata$VST
## W = 0.95537, p-value = 3.717e-11
As we can see from the output, the VST scores are not normally distributed, indicating that we should not use parametric statistics using these values. Fortunately, there are a number of non-parametric tests that can be used (more on this later).
Each statistical test has a variety of assumptions. As we learn about different tests, we will learn about these assumptions (and what to do if our data violates them).
In quantitative research, we want to determine the probability that our observations (e.g., relationships between two variables or differences between groups) are due to chance (or, alternatively, are due to something such as a particular teaching method). As we will see below, probability estimates are inextricably connected to sample size. Accordingly, we also want to know how large the differences between the two populations are or how strong the relationship between two variables are (via an effect size).
Statistical significance is inherently tied to probability. In this course we will talk about a number of (relatively) complicated methods for determining the probability of observing a particular result, but in this tutorial we will keep things reasonably simple. In real research, for example, we might want to know whether vocabulary test scores are significantly different between two groups of students (who received different instructional methods, were in different educational context, had different L1s, etc.). We will get to that issue soon enough, but for now we are going to examine a simpler issue. Given a particular population (lets say a group of intermediate language learners at a particular institution), what is the probability that a particular student will earn a particular score?
Let us assume for a moment that the scores are normally distributed with a range from 30 to 50, a mean of 40, and a standard deviation of 5.
#don't worry too much about this code - see the plot
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
distr <- data.frame(x=seq(30, 50, length=300)) %>%
mutate( density = dnorm(x, mean=40, sd=5))
g9<- ggplot(distr, aes(x=x, y=density)) +
geom_line() +
geom_area() +
theme_bw() +
labs(
x = "Score",
y = "Density",
title = "Normal Distribution of Scores")
#print(g9)
One thing that we can examine is the probability that a student in the target population will receive a particular VST score. For example, what is the probability that a student randomly selected from our sample population would earn a 34 on the VST? By looking at a density plot, we can see that there is a very small chance that we would randomly select a smaller score than 34 and a much larger chance that we would randomly select a score higher than 34.
#don't worry too much about this code - see the plot
distr <- data.frame(x=seq(30, 50, length=300)) %>%
mutate(density = dnorm(x, mean=40, sd=5),
group = ifelse(x<=34, 'Worse score than 34','Better score than 34'))
g10<- ggplot(distr, aes(x = x, y = density, fill = group)) +
geom_line() +
geom_area(alpha = 0.5) + # Adjusting transparency for better visualization
scale_fill_viridis(discrete = TRUE) + # Applying a color-blind friendly palette
theme_bw() +
labs(
x = "Score",
y = "Density",
title = "Probability Distribution of Scores")
#print(g10)
To check these probabilities precisely, we can use the pnorm() function.
#don't worry too much about this code
pnorm(34, mean = 40, sd = 5)
## [1] 0.1150697
The pnorm() function indicates that 11.5% of students (p = .115) scored lower than our randomly selected student. In other words, there is only an 11.5% chance that we would observe a smaller score from a randomly selected student. Conversely, this means that there is an 88.5% (0.885) chance thqt we would observe a higher score from a randomly selected student.
In classic quantitative research, we are rarely interested in what a single student does, rather, we tend to be interested in what groups of students tend to do. When designing studies, we take a sample of students from a population and then measure particular characteristics of the sample.
It is important to note that the characteristics of our samples will not necessarily perfectly reflect the characteristics of our population (due to sampling error). The larger the sample we take, however, the more likely our sample will reflect the population characteristics.
So, if we sample ten students from our vocabulary learning population (which has a mean score of 40) multiple times (we will do this three times below), we can expect to get a wider range of mean scores than if we sample 100 students from our population (we will do this three times below as well).
set.seed(124) #set random number generator seed
mean(rnorm(10, mean = 40, sd = 5)) #get random sample of 10 data points from a normal distribution with a mean of 40 and an sd of 5; get the mean score
## [1] 41.07383
mean(rnorm(10, mean = 40, sd = 5))
## [1] 39.0382
mean(rnorm(10, mean = 40, sd = 5))
## [1] 39.13158
set.seed(124) #set random number generator seed
mean(rnorm(100, mean = 40, sd = 5)) #get random sample of 100 data points from a normal distribution with a mean of 40 and an sd of 5; get the mean score
## [1] 40.0481
mean(rnorm(100, mean = 40, sd = 5))
## [1] 39.51771
mean(rnorm(100, mean = 40, sd = 5))
## [1] 39.82173
As we can see, the range of mean scores for the randomly sampled groups is closer to the population mean when the sample size is larger. When the sample size is 10, the mean scores can range over a point from the mean; when the sample size is 100, the mean scores are less than .5 point from the population mean. Note that if our standard deviation was wider, then these differences would be more pronounced.
This range will be even smaller if we take a larger sample (e.g., n=1000):
set.seed(124) #set random number generator seed
mean(rnorm(1000, mean = 40, sd = 5)) #get random sample of 100 data points from a normal distribution with a mean of 40 and an sd of 5; get the mean score
## [1] 39.67322
mean(rnorm(1000, mean = 40, sd = 5))
## [1] 39.99218
mean(rnorm(1000, mean = 40, sd = 5))
## [1] 40.06402
Importantly, if we randomly sample a population many times (e.g., 1,000) with a particular sample size, we can get a probability distribution of the mean scores. With this distribution, we can determine the probability that, given a random sample of n participants from a population with a particular distribution, we would get a particular mean score.
In (slightly) more complex situations, we might want to know whether one group of students (whose teacher used a new fancy vocabulary instruction method) earned higher vocabulary test scores than a second group of students (whose teacher used a boring old method of vocabulary instruction). The question we will ask (statistically) is the probability that the mean VST scores from the two groups could have come from the same population (i.e., students who studied vocabulary) OR if there is something special about one of the groups. We do this by checking the degree to which the mean scores could overlap in a repeated random sample from each group of observed scores (the actual calculation is slightly more complicated than this, but not much).
If the probability is high that the mean VST scores came from the same larger group (i.e., individuals who are in a language course), then we might say that the teaching method didn’t make a difference in VST scores. If the probability that the mean scores could come from the same distribution of scores is low (usually this is set to a probability or p value of .05), then we might say that the teaching method affected the scores.
There are many caveats and more complicated situations that we will tackle when we look at each test covered in this course. Stay tuned.
As we noted above, the larger the sample size is, the more confident we can be that the sample mean represents the population mean.
A very important limitation of tests of statistical significance is that they are affected by both the sample size and the size of the observed differences (i.e., the effect). When the sample size is large enough, almost all differences between two groups will be considered statistically significant (due to the Central Limit Theorem).
Effect sizes, however, are not affected by sample size and simply indicate the difference in average values across two populations (for difference tests) or how closely related two sets of variables are (for tests related to correlations). There are a number of different effect size calculations (we will learn a few of these as we learn different statistical tests), but arguably the most common are Cohen’s D (for difference tests) and r (for correlations). We’ll discuss r when we discuss correlations.
Cohen’s D is calculated as the |((mean of group A - mean of group B)/pooled standard deviation)|. In short, this tells us how many standard deviations apart the means for the two groups are.
Below, we have two samples (lets call them two classes with VST scores). The red line represents the distribution of VST scores in our previously discussed class (mean=40, sd = 5). The blue line represents the distribution of VST scores in an imaginary, high achieving class (mean = 45, sd = 5).
#don't worry too much about this code - see the plot
distr1 <- data.frame(x = seq(30, 50, length = 300)) %>%
mutate(density = dnorm(x, mean = 40, sd = 5))
distr2 <- data.frame(x = seq(35, 55, length = 300)) %>%
mutate(density = dnorm(x, mean = 45, sd = 5))
# Combine the data frames and add a grouping variable (added for plotting)
combined_distr <- rbind(distr %>% mutate(group = "distr1"),
distr2 %>% mutate(group = "distr2"))
# Plot
g11<- ggplot(combined_distr, aes(x = x, y = density, color = group)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_bw() +
labs(
x = "VST",
y = "Density",
title = "Comparative Normal Distributions of Two Samples")
#print(g11)
<img src=“plot11.png” alt=Plot showing two overlapping normal distribution curves. The first purple curve peaks at a mean of 40, and the second yellow curve peaks at a mean of 45 with the same standard deviation. The curves intersect around a score of 42, highlighting the range where the distributions of both classes overlap.”>
To calculate Cohen’s D for the difference between these two samples, we subtract the mean of the blue class (45) from the mean of the red class (40), which leaves us with (40-45 = -5). We then divide this number by the pooled standard deviation (in this case 5). So, Cohen’s D = (|-5/5| = 1.00), which is a large effect (according to Cohen, 1988).
More soon!