Back to Homepage

Loading data, more visualization, assumptions, significance, effect size

Tutorial Objectives

The objectives of this tutorial are to:

  • Load data into R from a .csv file
  • Continue to build proficiency with creating (and interpreting) data visualizations
    • scatterplots
    • boxplots
  • Be introduced to some important terms and concepts in statistical analysis:
    • mean
    • median
    • mode
    • variance
    • standard deviation
    • normal distribution
    • probability
    • statistical significance
    • effect size

Part I: Loading data, more practice with visualizing data

Description of Data

The data represents a sample of a simplified (and adapted) version of the meta data for the written section of the International Corpus Network of Asian Learners of English. This corpus includes argumentative essays written on two prompts by college students for a range of countries in Asia. As we see below, this file includes a number of variables including a participant identifier, nationality of the participant, their score on a vocabulary size test (VST), and their English proficiency.

Loading data

In previous tutorials, we have used dataframes that are part of the R base package or are part of ggplot2. Here, we will load data from a spreadsheet that is formatted as a comma separated values (.csv) document.

Our first step will be to open a new R script and save it in a convenient location (e.g., in a folder named “Rday3” on your desktop).

Our second step will be to download the target .csv file, extract it, and save a copy of the target .csv file (“ICNALE_500_simple.csv”) in a folder called “data” in the Rday3 folder.

Third, with our new R script open in RStudio, we will click on “Session” in the toolbar and “Set Working Directory”” to “Source File Location”. This will let RStudio know where to look for your file(s).

Finally, we will load our target .csv file (“ICNALE_500_simple.csv”) as a dataframe (and in this case we will call that dataframe “fundata”).

fundata <- read.csv("data/ICNALE_500_simple.csv", header = TRUE) #this presumes that the .csv file is in a subfolder entitled "data"
summary(fundata)
##  Participant        Nationality             VST        Proficiency_CEFR  
##  Length:500         Length:500         Min.   :10.00   Length:500        
##  Class :character   Class :character   1st Qu.:24.00   Class :character  
##  Mode  :character   Mode  :character   Median :33.00   Mode  :character  
##                                        Mean   :32.56                     
##                                        3rd Qu.:41.00                     
##                                        Max.   :50.00                     
##  Proficiency_Number
##  Min.   :1.00      
##  1st Qu.:1.00      
##  Median :2.00      
##  Mean   :2.24      
##  3rd Qu.:3.00      
##  Max.   :4.00

Based on the summary output, we see that our dataframe has four variables (Participant, Nationality, VST, Proficiency_CEFR, and Proficiency_Number).

These variable names represent the following: - Participant: Participant identifier - Nationality: Nationality of participant - VST: Score on receptive vocabulary size test - Proficiency_CEFR: Categorical proficiency rating based on the Common European Framework of Reference for Languges (CEFR) - Proficiency_Number: CEFR rating converted to a numerical scale (ranging from 1 to 4)

Visualizing the data

Lets apply what we learned in the last tutorial to this new data. Don’t forget to load ggplot2!

library(ggplot2)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
library(viridis) #color-friendly palettes
## Loading required package: viridisLite

Scatterplots

First, lets create a scatterplot with a line of best fit to examine the relationship between VST (y-axis) and Proficiency_Number (x-axis). What does the scatterplot indicate about the relationship between VST scores and Proficiency?

g1 <- ggplot(data = fundata) +
  geom_point(mapping = aes(x = Proficiency_Number, y = VST)) +
  geom_smooth(mapping = aes(x = Proficiency_Number, y = VST), method = lm) +
  scale_color_viridis(discrete = TRUE) +
  labs(
    x = "Proficiency_Number",
    y = "VST",
    title = "Relationship between Proficiency Number and VST")

#print(g1)
## `geom_smooth()` using formula 'y ~ x'