Skip to the content.

Python Tutorial 9: Calculating and Outputting Text-Level Variables

Back to Tutorial Index

(updated 11-19-2020)

In this tutorial, we will apply concepts learned in previous tutorials to create a program that reads in files (e.g., learner corpus texts), calculates a number of indices (i.e., number of words, average frequency score, lexical diversity score) for each text, and then writes the output to a tab-delimited spreadsheet file. This basic program has the building blocks for the creation of much more complex programs (like TAALED and TAALES).

We will use a frequency list derived from the Brown Corpus and will process a version of the NICT JLE learner corpus. Of course, this code could be used for a wide variety of purposes (and corpus types). Click here to download version of corpus used in this tutorial.

Import packages and frequency list

First, we will import the packages necessary for subsequent functions:

import math #for logarithmic transformation
import glob #for grabbing filenames
import operator #for dictionary sorting

Then, we will import our frequency list using code we generated while completing the exercises from Python Tutorial 3:

def splitter(input_string): #presumes that the list is tab-delimitted
	output_list = []
	#insert code here
	for x in input_string.split("\n")[1:]: #iterate through sample string split by "\n", skip header row
		cols = x.split("\t") #split the item by "\t"
		word = cols[0] #the first item will be the word
		freq = cols[1] #the second will be the frequency value
		output_list.append([word,freq]) #append the [word, freq] list to the output list


def freq_dicter(input_list):
	output_dict = {}
	#insert code here
	for x in input_list: #iterate through list
		word = x[0] #word is the first item
		freq = float(x[1]) #frequency is second item (convert to float using float())
		output_dict[word] = freq #assign key:value pair


def file_freq_dicter(filename):
	#out_dict = {} #if you use the previously defined function freq_dicter() this is not necessary
	spreadsheet = open(filename).read() #open and read the file here
	split_ss = splitter(spreadsheet)#split the string into rows
	out_dict = freq_dicter(split_ss)#iterate through the rows and assign the word as the key and the frequency as the value


Finally, we import our frequency list (see bottom of this tutorial for the code used to create the frequency list). The frequency list can be downloaded here.

brown_freq = file_freq_dicter("brown_freq_2020-11-19.txt")

Functions for calculating lexical indices

First, we will use the safe_divide() function from Tutorial 3.

def safe_divide(numerator,denominator): #this function has two arguments
	if denominator == 0: #if the denominator is 0
		output = 0 #the the output is 0
	else: #otherwise
		output = numerator/denominator #the output is the numerator divided by the denominator

	return(output) #return output

Then we will create a simple function for calculating the number of words in a text.

def word_counter(low): #list of words
	nwords = len(low)

Next, we will create a function that will calculate the average (log-transformed) frequency value for the words in each text. This function will look up each word in a text, find the frequency of that word in a reference corpus, log-transform the frequency value (to help account for the Zipfian distribution of words in a corpus), and create an averaged score for the whole text.

def frequency_count(tok_text,freq_dict):
	freq_sum = 0
	word_sum = 0
	for x in tok_text:
		if x in freq_dict: #if the word is in the frequency dictionary
			freq_sum += math.log(freq_dict[x]) #add the (logged) frequency value to the freq_sum counter
			word_sum += 1 #add one to the word_sum counter
			continue #if the word isn't in the frequency dictionary, we will ignore it in our index calculation

	return(safe_divide(freq_sum,word_sum)) #return average (logged) frequency score for words in the text

Finally (for now), we will create a function that calculates a score representing the diversity of lexical items in a text. This particular index, moving average type-token ratio (MATTR) has been shown to be independent of text length (unlike many other well-known indices). See Covington et al. (2010), Kyle et al. (2020), and/or Zenker and Kyle (2020) for more details.

def lexical_diversity(tok_text,window_length = 50): #this is for moving average type token ratio (TTR). See Covington et al., 2010; Kyle et al. (2020); Zenker & Kyle (2020)
	if len(tok_text) < (window_length + 1):
		ma_ttr = safe_divide(len(set(tok_text)),len(tok_text))

		sum_ttr = 0
		denom = 0
		for x in range(len(tok_text)):
			small_text = tok_text[x:(x + window_length)]
			if len(small_text) < window_length:
			denom += 1
			sum_ttr+= safe_divide(len(set(small_text)),float(window_length))
		ma_ttr = safe_divide(sum_ttr,denom)

	return ma_ttr


We will use a version of the tokenize() function from Tutorial 4 to tokenize our texts.

def tokenize(input_string): #input_string = text string
	tokenized = [] #empty list that will be returned

	##### CHANGES TO CODE HERE #######
	#these are the punctuation marks in the Brown corpus + '"'
	punct_list = ['-',',','.',"'",'&','`','?','!',';',':','(',')','$','/','%','*','+','[',']','{','}','"']

	#this is a sample (but potentially incomplete) list of items to replace with spaces
	replace_list = ["\n","\t"]

	#This is a sample (but potentially incomplete) list if items to ignore
	ignore_list = [""]

	##### CHANGES TO CODE HERE #######
	#iterate through the punctuation list and delete each item
	for x in punct_list:
		input_string = input_string.replace(x, "") #instead of adding a space before punctuation marks, we will delete them (by replacing with nothing)

	#iterate through the replace list and replace it with a space
	for x in replace_list:
		input_string = input_string.replace(x," ")

	#our examples will be in English, so for now we will lower them
	#this is, of course optional
	input_string = input_string.lower()

	#then we split the string into a list
	input_list = input_string.split(" ")

	for x in input_list:
		if x not in ignore_list: #if item is not in the ignore list
			tokenized.append(x) #add it to the list "tokenized"


Putting it all together

Now, we can create a function text_processor() that will read all files in a folder, calculate a number of lexical indices, and then outputs those to a tab-delimited file.

def text_processor(folder,outname): #folder name, name of output file
	corp_dict = {} #dictionary to store all data (not absolutely necessary, but potentially helpful)
	outf = open(outname,"w") #create output file
	outf.write("\t".join(["filename","nwords","av_freq","mattr"])) #write header
	filenames = glob.glob(folder + "/*") #get filenames in folder

	for filename in filenames: #iterate through filenames
		text_d = {} #create text dictionary to store indices for each text
		simple_fname = filename.split("/")[-1] #get last part of filename
		text = tokenize(open(filename, errors = "ignore").read()) #read file and tokenize it

		#add data to the text dictionary:
		text_d["filename"] = simple_fname
		text_d["nwords"] = word_counter(text) #calculate number of words
		text_d["av_freq"] = frequency_count(text,brown_freq) #calculate average frequency
		text_d["mattr"] = lexical_diversity(text)

		### add more stuff to dictionary here as needed ###

		#add text dictionary to corpus dictionary (not absolutely necessary, but potentially helpful)
		corp_dict[simple_fname] = text_d

		out_line = [text_d["filename"],str(text_d["nwords"]),str(text_d["av_freq"]),str(text_d["mattr"])] #create line for output, make sure to turn any numbers to strings
		outf.write("\n" + "\t".join(out_line)) #write line

	outf.flush() #flush buffer
	outf.close() #close_file


Below, we use our text_processor() function to calculate lexical indices in the NICT JLE learner corpus (though almost any properly-formatted corpus could be used!)

nict_data = text_processor("NICT_JLE_CLEANED","lex_richness_JLE.txt") #write data to a file named "lex_richness_JLE.txt"

print(nict_data["file00301.txt"]) #print the data for one file
{'filename': 'file00301.txt', 'nwords': 1109, 'av_freq': 7.110700325247902, 'mattr': 0.7109433962264144}

The output file generated by this code is available here.

Appendix: Code for generating frequency list

The code used for generating the frequency list can be found below (note that the tokenize() function above is referenced here as well).

def corpus_freq(dir_name):
	freq = {} #create an empty dictionary to store the word : frequency pairs

	#create a list that includes all files in the dir_name folder that end in ".txt"
	filenames = glob.glob(dir_name + "/*.txt")

	#iterate through each file:
	for filename in filenames:
		#open the file as a string
		text = open(filename, errors = "ignore").read()
		#tokenize text using our tokenize() function
		tokenized = tokenize(text)

		#iterate through the lemmatized text and add words to the frequency dictionary
		for x in tokenized:
			#the first time we see a particular word we create a key:value pair
			if x not in freq:
				freq[x] = 1
			#when we see a word subsequent times, we add (+=) one to the frequency count
				freq[x] += 1


def freq_writer(freq_list,filename):
	outf = open(filename, "w")
	outf.write("word\tfrequency") #write header

	for x in freq_list:
		outf.write("\n" + x[0] + "\t" + str(x[1])) #newline character + word + tab + string version of Frequency
	outf.flush() #flush buffer
	outf.close() #close file

brown_rf = sorted(corpus_freq("brown_corpus").items(),key=operator.itemgetter(1),reverse = True)
freq_writer(brown_rf, "brown_freq_2020-11-19.txt")