Python Tutorial 9: Calculating and Outputting Text-Level Variables
(updated 11-19-2020)
In this tutorial, we will apply concepts learned in previous tutorials to create a program that reads in files (e.g., learner corpus texts), calculates a number of indices (i.e., number of words, average frequency score, lexical diversity score) for each text, and then writes the output to a tab-delimited spreadsheet file. This basic program has the building blocks for the creation of much more complex programs (like TAALED and TAALES).
We will use a frequency list derived from the Brown Corpus and will process a version of the NICT JLE learner corpus. Of course, this code could be used for a wide variety of purposes (and corpus types). Click here to download version of corpus used in this tutorial.
Import packages and frequency list
First, we will import the packages necessary for subsequent functions:
import math #for logarithmic transformation
import glob #for grabbing filenames
import operator #for dictionary sorting
Then, we will import our frequency list using code we generated while completing the exercises from Python Tutorial 3:
def splitter(input_string): #presumes that the list is tab-delimitted
output_list = []
#insert code here
for x in input_string.split("\n")[1:]: #iterate through sample string split by "\n", skip header row
cols = x.split("\t") #split the item by "\t"
word = cols[0] #the first item will be the word
freq = cols[1] #the second will be the frequency value
output_list.append([word,freq]) #append the [word, freq] list to the output list
return(output_list)
def freq_dicter(input_list):
output_dict = {}
#insert code here
for x in input_list: #iterate through list
word = x[0] #word is the first item
freq = float(x[1]) #frequency is second item (convert to float using float())
output_dict[word] = freq #assign key:value pair
return(output_dict)
def file_freq_dicter(filename):
#out_dict = {} #if you use the previously defined function freq_dicter() this is not necessary
spreadsheet = open(filename).read() #open and read the file here
split_ss = splitter(spreadsheet)#split the string into rows
out_dict = freq_dicter(split_ss)#iterate through the rows and assign the word as the key and the frequency as the value
return(out_dict)
Finally, we import our frequency list (see bottom of this tutorial for the code used to create the frequency list). The frequency list can be downloaded here.
brown_freq = file_freq_dicter("brown_freq_2020-11-19.txt")
Functions for calculating lexical indices
First, we will use the safe_divide() function from Tutorial 3.
def safe_divide(numerator,denominator): #this function has two arguments
if denominator == 0: #if the denominator is 0
output = 0 #the the output is 0
else: #otherwise
output = numerator/denominator #the output is the numerator divided by the denominator
return(output) #return output
Then we will create a simple function for calculating the number of words in a text.
def word_counter(low): #list of words
nwords = len(low)
return(nwords)
Next, we will create a function that will calculate the average (log-transformed) frequency value for the words in each text. This function will look up each word in a text, find the frequency of that word in a reference corpus, log-transform the frequency value (to help account for the Zipfian distribution of words in a corpus), and create an averaged score for the whole text.
def frequency_count(tok_text,freq_dict):
freq_sum = 0
word_sum = 0
for x in tok_text:
if x in freq_dict: #if the word is in the frequency dictionary
freq_sum += math.log(freq_dict[x]) #add the (logged) frequency value to the freq_sum counter
word_sum += 1 #add one to the word_sum counter
else:
continue #if the word isn't in the frequency dictionary, we will ignore it in our index calculation
return(safe_divide(freq_sum,word_sum)) #return average (logged) frequency score for words in the text
Finally (for now), we will create a function that calculates a score representing the diversity of lexical items in a text. This particular index, moving average type-token ratio (MATTR) has been shown to be independent of text length (unlike many other well-known indices). See Covington et al. (2010), Kyle et al. (2020), and/or Zenker and Kyle (2020) for more details.
def lexical_diversity(tok_text,window_length = 50): #this is for moving average type token ratio (TTR). See Covington et al., 2010; Kyle et al. (2020); Zenker & Kyle (2020)
if len(tok_text) < (window_length + 1):
ma_ttr = safe_divide(len(set(tok_text)),len(tok_text))
else:
sum_ttr = 0
denom = 0
for x in range(len(tok_text)):
small_text = tok_text[x:(x + window_length)]
if len(small_text) < window_length:
break
denom += 1
sum_ttr+= safe_divide(len(set(small_text)),float(window_length))
ma_ttr = safe_divide(sum_ttr,denom)
return ma_ttr
Tokenizer
We will use a version of the tokenize() function from Tutorial 4 to tokenize our texts.
def tokenize(input_string): #input_string = text string
tokenized = [] #empty list that will be returned
##### CHANGES TO CODE HERE #######
#these are the punctuation marks in the Brown corpus + '"'
punct_list = ['-',',','.',"'",'&','`','?','!',';',':','(',')','$','/','%','*','+','[',']','{','}','"']
#this is a sample (but potentially incomplete) list of items to replace with spaces
replace_list = ["\n","\t"]
#This is a sample (but potentially incomplete) list if items to ignore
ignore_list = [""]
##### CHANGES TO CODE HERE #######
#iterate through the punctuation list and delete each item
for x in punct_list:
input_string = input_string.replace(x, "") #instead of adding a space before punctuation marks, we will delete them (by replacing with nothing)
#iterate through the replace list and replace it with a space
for x in replace_list:
input_string = input_string.replace(x," ")
#our examples will be in English, so for now we will lower them
#this is, of course optional
input_string = input_string.lower()
#then we split the string into a list
input_list = input_string.split(" ")
for x in input_list:
if x not in ignore_list: #if item is not in the ignore list
tokenized.append(x) #add it to the list "tokenized"
return(tokenized)
Putting it all together
Now, we can create a function text_processor() that will read all files in a folder, calculate a number of lexical indices, and then outputs those to a tab-delimited file.
def text_processor(folder,outname): #folder name, name of output file
corp_dict = {} #dictionary to store all data (not absolutely necessary, but potentially helpful)
outf = open(outname,"w") #create output file
outf.write("\t".join(["filename","nwords","av_freq","mattr"])) #write header
filenames = glob.glob(folder + "/*") #get filenames in folder
for filename in filenames: #iterate through filenames
print(filename)
text_d = {} #create text dictionary to store indices for each text
simple_fname = filename.split("/")[-1] #get last part of filename
text = tokenize(open(filename, errors = "ignore").read()) #read file and tokenize it
#add data to the text dictionary:
text_d["filename"] = simple_fname
text_d["nwords"] = word_counter(text) #calculate number of words
text_d["av_freq"] = frequency_count(text,brown_freq) #calculate average frequency
text_d["mattr"] = lexical_diversity(text)
### add more stuff to dictionary here as needed ###
#add text dictionary to corpus dictionary (not absolutely necessary, but potentially helpful)
corp_dict[simple_fname] = text_d
out_line = [text_d["filename"],str(text_d["nwords"]),str(text_d["av_freq"]),str(text_d["mattr"])] #create line for output, make sure to turn any numbers to strings
outf.write("\n" + "\t".join(out_line)) #write line
outf.flush() #flush buffer
outf.close() #close_file
return(corp_dict)
Below, we use our text_processor() function to calculate lexical indices in the NICT JLE learner corpus (though almost any properly-formatted corpus could be used!)
nict_data = text_processor("NICT_JLE_CLEANED","lex_richness_JLE.txt") #write data to a file named "lex_richness_JLE.txt"
print(nict_data["file00301.txt"]) #print the data for one file
{'filename': 'file00301.txt', 'nwords': 1109, 'av_freq': 7.110700325247902, 'mattr': 0.7109433962264144}
The output file generated by this code is available here.
Appendix: Code for generating frequency list
The code used for generating the frequency list can be found below (note that the tokenize() function above is referenced here as well).
def corpus_freq(dir_name):
freq = {} #create an empty dictionary to store the word : frequency pairs
#create a list that includes all files in the dir_name folder that end in ".txt"
filenames = glob.glob(dir_name + "/*.txt")
#iterate through each file:
for filename in filenames:
#open the file as a string
text = open(filename, errors = "ignore").read()
#tokenize text using our tokenize() function
tokenized = tokenize(text)
#iterate through the lemmatized text and add words to the frequency dictionary
for x in tokenized:
#the first time we see a particular word we create a key:value pair
if x not in freq:
freq[x] = 1
#when we see a word subsequent times, we add (+=) one to the frequency count
else:
freq[x] += 1
return(freq)
def freq_writer(freq_list,filename):
outf = open(filename, "w")
outf.write("word\tfrequency") #write header
for x in freq_list:
outf.write("\n" + x[0] + "\t" + str(x[1])) #newline character + word + tab + string version of Frequency
outf.flush() #flush buffer
outf.close() #close file
brown_rf = sorted(corpus_freq("brown_corpus").items(),key=operator.itemgetter(1),reverse = True)
freq_writer(brown_rf, "brown_freq_2020-11-19.txt")