Back to Homepage

Correlation: Examining the relationship between variables

The basics

A correlation is a measure that indicates the degree to which the values of one variable change with respect to another variable. For example, we might examine the relationship between the temperature and the number of students wearing a sweatshirt/jacket on campus. In second language learning, we might look at the relationship between the number of hours a student spends studying and their scores on a proficiency test. For most of this tutorial, we will be looking at a common type of correlational analysis (Pearson’s product-moment correlation), though the calculation of other correlation coefficients will be briefly discussed at the end.

Importantly, correlation values (hereafter referred to as r) are an effect size that range from -1.0 to 1.0. When we interpret the size of a correlation, we are concerned with the absolute value of the number (i.e., how far the value is from zero), not the directionality. For example, an r value of -.700 is bigger than an r value of .300 (-.700 is further from zero than .300).

The sign of the r value tells us whether the values from variable A increase as the values of variable B increase (resulting in a positive sign) OR if the variable A values decrease as the variable B values increase (a negative sign). For example, we would expect a NEGATIVE correlation between temperature and number of students wearing jackets on campus (as one value goes down [temperature] the other goes up [number of jacket-wearing individuals] and vice versa). On the other hand, we would expect a POSITIVE relationship between the number of hours spent studying a language and their scores on a language proficiency test (as one value goes up [time spent studying] the other also goes up [proficiency scores] and vice versa).

Finally, p values are also calculated for correlation analyses. These values tell us the probability that observed relationship between the variables would be observed in our sample if there were actually no relationship between them in the larger population.

Assumptions

Pearson’s correlation has the following assumptions:

  • The variables must be continuous (see other tests for ordinal or categorical data)

  • The variables must have a linear relationship (i.e., do not have a curvilinear or other relationship)

  • There are no outliers (or there are only minimal outliers in large samples)

  • The variables must have a bivariate normal distribution (Note that this is conceptually related to, but different from the normal distributions that we have been discussing so far. Also note that this is likely checked fairly rarely in the real world.)

Checking assumptions

In our first example, we will examine the relationship between between number of words (as a proxy for proficiency) and lexical sophistication (measured as word frequency) in a corpus of argumentative essays written as part of a standardized test of English proficiency. Based on theories of language learning (e.g., Ellis, 2002), we would expect that more proficient writers would (on average) use less frequent words (which are considered to be MORE sophisticated).

These variables (and a few others) are included in the “correlation_sample.csv” file included on our Canvas page.

library(ggplot2) #load ggplot2
library(viridis) #color-friendly palettes

cor_data <- read.csv("data/correlation_sample.csv", header = TRUE) #read the spreadsheet "correlation_sample.csv" into r as a dataframe
summary(cor_data) #get descriptive statistics for the dataset
##      Score        pass.fail            Prompt              nwords     
##  Min.   :1.000   Length:480         Length:480         Min.   : 61.0  
##  1st Qu.:3.000   Class :character   Class :character   1st Qu.:273.0  
##  Median :3.500   Mode  :character   Mode  :character   Median :321.0  
##  Mean   :3.427                                         Mean   :317.7  
##  3rd Qu.:4.000                                         3rd Qu.:355.2  
##  Max.   :5.000                                         Max.   :586.0  
##   frequency_AW    frequency_CW    frequency_FW   bigram_frequency
##  Min.   :2.963   Min.   :2.232   Min.   :3.598   Min.   :1.240   
##  1st Qu.:3.187   1st Qu.:2.656   1st Qu.:3.827   1st Qu.:1.440   
##  Median :3.237   Median :2.726   Median :3.903   Median :1.500   
##  Mean   :3.234   Mean   :2.723   Mean   :3.902   Mean   :1.500   
##  3rd Qu.:3.284   3rd Qu.:2.789   3rd Qu.:3.975   3rd Qu.:1.559   
##  Max.   :3.489   Max.   :3.095   Max.   :4.235   Max.   :1.755

Assumption 1: Continuous or ratio data

Our data for each variable is continuous (it is not, for example, categorical), so we can continue with our analysis.

Assumption 2: Linearity

To check the linearity of our data, we will create a scatterplot. For our data to meet the criteria of linearity, it will need to fall in roughly a straight line (and not one that is curvilinear).

The blue line below represents the (straight) line of best fit for the data, while the red line represents a line of best fit based on a moving average (called a “Loess” line). In order to meet the assumption of linearity, we want the red line to approximate the blue line. In this case, we can make a pretty strong argument that we meet the assumption of linearity.

g1 <- ggplot(cor_data, aes(x = frequency_CW, y= nwords )) +
  geom_point() +
  geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  scale_color_viridis(discrete = TRUE) +  
  theme_minimal() 

#print(g1)

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The red line meant to approximate the blue line. The gray shaded area around each line indicates the confidence interval for that regression.

Assumption 3: Minimal outliers

Outliers can strongly affect our correlation coefficents (particularly in small datasets). Because our dataset is fairly large (480 participants), having a few outliers will not be a large problem. To check for outliers, lets take a look at the scatterplot again.

Scatter plot showing black points and two smooth lines running through the data points: one in blue, which shows a line of best fit, and the other in red, which shows a loess regression. The red line meant to approximate the blue line. The gray shaded area around each line indicates the confidence interval for that regression. The distribution of points appears random without obvious clusters.

It appears as though we only have a few outliers (and one particularly extreme one in the bottom right section of the plot.) Based on the size of the data, there don’t seem to be any large issues to be worried about. If we look at the red line, we can actually see the effect of the outliers (e.g., the slight rise in the line at frequency of 2.9 words and the rise in the line at a frequency of 2.25. However, because these perturbations in the line are rather small, we can conclude that they are not causing large issues.

Assumption 4: Normality

Next, we will check the assumption of bivariate normality for our variables using two-dimensional density plots. We can look a the plot in a similar way that we would look at a contour map. In this case, the lighter colored regions have higher density (think altitude) than the darker shaded regions. A perfect bivariate distribution would look like a 3-dimensional normal distribution curve. What we have below roughly follows this shape: The most dense regions are in the middle of the plot. The contours of our “mountain” are not perfectly symmetrical (we see some ridges and valleys leading from the summit), but they are reasonably symmetrical. Thus, our visual inspection indicates that we have a roughly normal bivariate distribution.

g2 <- ggplot(cor_data, aes(x = frequency_CW, y=nwords )) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon")+
  scale_fill_viridis() +
  theme_minimal()  

#print(g2)

Two-dimensional density plot with shades representing the distribution of frequency_CW and nwords. The brightest area indicates the highest data concentration.

We can also check this using a statistical test. As we see below, our distribution does not significantly deviate from a normal distribution with regard to skewness (symmetry of the distribution; p = 0.210), but it does significantly deviate from a normal distribution with regard to kurtosis (height/width of the distribution; p = 0.00000000199). So, our data is not perfectly normal. We can either proceed with a Pearson correlation (on the basis of our visual inspection/skewness) of we can use a nonparametric test. Note that the r value will still be valid if we run a Pearson test, but we may have to question the p value.

library(dplyr)
library(QuantPsyc)

cor_data.small <- cor_data %>% dplyr::select(nwords,frequency_CW)

mult.norm(cor_data.small)$mult.test
##             Beta-hat    kappa        p-val
## Skewness  0.07322971 5.858376 2.099774e-01
## Kurtosis 10.19041532 5.998699 1.989042e-09

Calculating Pearson’s r

Now, we will actually calculate the correlation between number of words (nwords) and content word frequency (frequency_CW) using the cor.test function, which will give us an r value and a p value.

cor.test(cor_data$nwords,cor_data$frequency_CW)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_data$nwords and cor_data$frequency_CW
## t = -4.5377, df = 478, p-value = 7.201e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2874922 -0.1158270
## sample estimates:
##        cor 
## -0.2032207

Interpreting the p value

In this case, the results indicate a significant relationship between number of words (nwords) and content word frequency (frequency_CW): p = 7.201e-06, which translates to .000007201. In other words, there is a .0007% chance that we would observe this data in our sample if there were no relationship between number of words and content word frequency in the larger population.

Interpreting the effect size (r)

The correlation value (which is indicated above as “cor”; r = -.203) indicates that a) there is an inverse relationship between number of words and content word frequency, and b) this relationship is fairly weak (Cohen, 1988 suggests that small/weak correlations range from .100 to .299).

Below are the full guidelines from Cohen (1988):

  • Small: .100 - .299

  • Medium: .300 - .599

  • Large: .600 - 1.000

Note that Plonsky & Oswald (2014) suggest the following guidelines for applied linguistics research:

  • Small: .250 - .399

  • Medium: .400 - .599

  • Large: .600 - 1.000

Sample Write-Up

This study examined the relationship between lexical sophistication (measured as the mean frequency of content words) and writing proficiency (imperfectly! measured as number of words per essay) in a corpus of argumentative essays written by L2 users of English. Descriptive statistics are reported in Table 1.

Table 1.
Descriptive statistics for indices used in this study

Index n Mean Standard Deviation
number of words 480 317.65 78.28
Frequency CW 480 2.72 0.11

The data met the assumptions of a Pearson correlation. Therefore, a Pearson correlation between number of words in each essay and content word frequency to determine the strength of the relationship between the two indices. The results of the Pearson correlation, which indicated a small, negative relationship between the two variables ( r = -0.203, p < .001), are reported in Table 2. A scatterplot of the data (with a line of best fit) is displayed in Figure 1.

Table 2.
Results of the Pearson correlation analysis with number of words

Index n r p
Frequency CW 480 -0.203 < .001
g3 <- ggplot(cor_data, aes(x = frequency_CW , y=nwords )) +
  geom_point() +
  #geom_smooth(method = "loess",color = "red") + #this is a line of best fit based on a moving average
  geom_smooth(method = "lm") + #this is a line of best fit based on the entire dataset
  scale_color_viridis(discrete = TRUE) +  
  theme_minimal() 

#print(g3)

Scatter plot showing black points and a smooth line in blue running through the data points. The line shows a line of best fit, and the other in red, which shows a loess regression. The gray shaded area around each line indicates the confidence interval for that regression.

Figure 1. Relationship between Number of Words and Content Word Frequency

Other correlation tests

If your variables do not meet the assumptions of the Pearson product moment correlation, all is not lost! The following correlation coefficents can also be calculated.

Ordinal data: Spearman’s Rho and Kendall’s Tau

If your data violates the assumptions of bivariate normality, includes ordinal variables (and/or continous ones), or has a curvilinear relationship (but that still generally follows an upward or downward trend), you can use Spearman’s Rho or Kendall’s Tau. If you have a lot of data points with the same score, Kendall’s Tau is preferred.

Below, we will repeat our analyses above, but this time with the two non-parametric tests.

cor.test(cor_data$nwords,cor_data$frequency_CW,method = ("spearman"))
## 
##  Spearman's rank correlation rho
## 
## data:  cor_data$nwords and cor_data$frequency_CW
## S = 21225779, p-value = 0.0008638
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1515772
cor.test(cor_data$nwords,cor_data$frequency_CW,method = ("kendall"))
## 
##  Kendall's rank correlation tau
## 
## data:  cor_data$nwords and cor_data$frequency_CW
## z = -3.2893, p-value = 0.001004
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.1006522

The results from both tests indicate a significant relationship. Further, both tests indicate that the relationship between Score and content word frequency is negative. The Spearman Rho value (-.152) indicates a small/weak effect (according to Cohen, 1988). The Kendall Tau value (-.101) is more conservative and indicates a small/weak effect (according to Cohen, 1988).