**The objectives of this tutorial are to:**

- Load data into R from a .csv file
- Continue to build proficiency with creating (and interpreting) data
visualizations
- scatterplots
- boxplots

- Be introduced to some important terms and concepts in statistical
analysis:
- mean
- median
- mode
- variance
- standard deviation
- normal distribution
- probability
- statistical significance
- effect size

The data represents a sample of a simplified (and adapted) version of the meta data for the written section of the International Corpus Network of Asian Learners of English. This corpus includes argumentative essays written on two prompts by college students for a range of countries in Asia. As we see below, this file includes a number of variables including a participant identifier, nationality of the participant, their score on a vocabulary size test (VST), and their English proficiency.

In previous tutorials, we have used dataframes that are part of the R base package or are part of ggplot2. Here, we will load data from a spreadsheet that is formatted as a comma separated values (.csv) document.

Our first step will be to open a new R script and save it in a convenient location (e.g., in a folder named “Rday3” on your desktop).

Our second step will be to download the target .csv file, extract it, and save a copy of the target .csv file (“ICNALE_500_simple.csv”) in a folder called “data” in the Rday3 folder.

Third, with our new R script open in RStudio, we will click on “Session” in the toolbar and “Set Working Directory”” to “Source File Location”. This will let RStudio know where to look for your file(s).

Finally, we will load our target .csv file (“ICNALE_500_simple.csv”) as a dataframe (and in this case we will call that dataframe “fundata”).

```
fundata <- read.csv("data/ICNALE_500_simple.csv", header = TRUE) #this presumes that the .csv file is in a subfolder entitled "data"
summary(fundata)
```

```
## Participant Nationality VST Proficiency_CEFR
## Length:500 Length:500 Min. :10.00 Length:500
## Class :character Class :character 1st Qu.:24.00 Class :character
## Mode :character Mode :character Median :33.00 Mode :character
## Mean :32.56
## 3rd Qu.:41.00
## Max. :50.00
## Proficiency_Number
## Min. :1.00
## 1st Qu.:1.00
## Median :2.00
## Mean :2.24
## 3rd Qu.:3.00
## Max. :4.00
```

Based on the summary output, we see that our dataframe has four variables (Participant, Nationality, VST, Proficiency_CEFR, and Proficiency_Number).

These variable names represent the following: - Participant: Participant identifier - Nationality: Nationality of participant - VST: Score on receptive vocabulary size test - Proficiency_CEFR: Categorical proficiency rating based on the Common European Framework of Reference for Languges (CEFR) - Proficiency_Number: CEFR rating converted to a numerical scale (ranging from 1 to 4)

Lets apply what we learned in the last tutorial to this new data. Don’t forget to load ggplot2!

`library(ggplot2)`

```
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
```

`library(viridis) #color-friendly palettes`

`## Loading required package: viridisLite`

First, lets create a scatterplot with a line of best fit to examine
the relationship between VST (y-axis) and Proficiency_Number (x-axis).
**What does the scatterplot indicate about the relationship
between VST scores and Proficiency?**

```
g1 <- ggplot(data = fundata) +
geom_point(mapping = aes(x = Proficiency_Number, y = VST)) +
geom_smooth(mapping = aes(x = Proficiency_Number, y = VST), method = lm) +
scale_color_viridis(discrete = TRUE) +
labs(
x = "Proficiency_Number",
y = "VST",
title = "Relationship between Proficiency Number and VST")
#print(g1)
```

`## `geom_smooth()` using formula 'y ~ x'`