Python Tutorial 4: Tokenization, Lemmatization, and Frequency Lists

(updated 10-2-2020)

In this tutorial, we will work on basic corpus analysis functions.

We will work on completing smaller tasks, including:

tokenization
lemmatization
frequency calculation

And then combine these tasks in a larger function that will read in a corpus and output a frequency dictionary. While there are many ways that we can use Python to accomplish this end goal, we will focus on writing simple scripts that are scalable (i.e., can be used with corpora of various sizes) and easily extended/revised for your own purposes.

Tokenization

We will read in a corpus file as a string. Our first step will be to convert the string of characters into a list of strings (words) that we can count and otherwise manipulate. We will also want to ensure that our characters are in the desired format (e.g., lower case, upper case, or a mix of the two) and that unwanted characters (such as punctuation marks) are separated from words (and/or removed).

For this function, we will use the .split() method (which we have discussed in previous tutorials).

For cleaning, we will use the .replace() method, which allows use to replace any string of characters with another string of characters.

In the example below, we will replace all periods “.” with a period and a space “ .”, which will separate periods from words, but will still retain them in our corpus.

#In this example, we will replace any periods with nothing (i.e., we will delete any periods)
text = "This is a sample string."
clean_text = text.replace("."," .")
print(clean_text)

> This is a sample string .

Note that we could also use regular expressions to delete/replace characters. While regular expressions can be very powerful, they are also more complicated, so we will hold off on discussing them (for now).

Below, we use the .replace() and .split() methods to write a function called tokenize(), which will take a string as an argument and output a tokenized list.

def tokenize(input_string):
	tokenized = [] #empty list that will be returned

	#this is a sample (but incomplete!) list of punctuation characters
	punct_list = [".", "?","!",",","'"]

	#this is a sample (but potentially incomplete) list of items to replace with spaces
	replace_list = ["\n","\t"]

	#This is a sample (but potentially incomplete) list if items to ignore
	ignore_list = [""]

	#iterate through the punctuation list and replace each item with a space + the item
	for x in punct_list:
		input_string = input_string.replace(x," " + x)

	#iterate through the replace list and replace it with a space
	for x in replace_list:
		input_string = input_string.replace(x," ")

	#our examples will be in English, so for now we will lower them
	#this is, of course optional
	input_string = input_string.lower()

	#then we split the string into a list
	input_list = input_string.split(" ")

	#finally, we ignore unwanted items
	for x in input_list:
		if x not in ignore_list: #if item is not in the ignore list
			tokenized.append(x) #add it to the list "tokenized"

	#Then, we return the list
	return(tokenized)

Now, we can try out our new function:

s1 = "This is a sample sentence. This is one too! Is this?"
l1 = tokenize(s1)
print(l1)

> ['this', 'is', 'a', 'sample', 'sentence', '.', 'this', 'is', 'one', 'too', '!', 'is', 'this', '?']

Lemmatization

There are many methods of lemmatizing. Here, we will use a very simple (but imperfect) dictionary-based method, which is increasingly referred to as “flemmatization” (see, e.g., Kyle, 2020). Note that with the methods below, we can also familize a text (as is commonly done in the Paul Nation tradition of vocabulary analysis; see Nation, 2006 for more details on word families).

In order to lemmatize our corpus, we need to complete two tasks. First, we need to load a lemma dictionary. Then, we will use that dictionary to convert a tokenized corpus into a lemmatized version.

For this tutorial, we will load a lemma dictionary that I already generated from the list provided by Laurence Anthony. For sake of simplicity and brevity, I am not going to go over generating the dictionary from a text file, but if you are interested, the code is available here.

Instead, we will load the lemma dictionary directly, using the pickle() module. Make sure that “ant_lemmas.pickle” is in your working directory, then run the following code to load it:

import pickle #load pickle module
lemma_dict = pickle.load(open("ant_lemmas.pickle","rb")) #open pickled dictionary and assign it to lemma_dict

The lemma dictionary includes word form : lemma pairs, as is demonstrated below:

print(lemma_dict["is"])
print(lemma_dict["ran"])

> be
> run

Now that we have a lemma dictionary, we can easily turn a tokenized text into a lemmatized text.

The function lemmatize() below takes two arguments (a list of words and a lemma_dictionary) and returns a list of lemmas.

tokenized is a tokenized list of words
lemma_d is a lemma dictionary that consists of {“word” : “lemma”} pairs

def lemmatize(tokenized,lemma_d): #takes a tokenized list words and a lemma dictionary as arguments
	lemmatized = [] #holder for lemma list

	for word in tokenized: #iterate through words in text
		if word in lemma_d: #if word is in lemma dictionary
			lemmatized.append(lemma_d[word]) #add the lemma for to lemma_text
		else:
			lemmatized.append(word) #otherwise, add the raw word to the lemma_text

	return(lemmatized) #return lemmatized corpus

Now, we can create a lemmatized version of our sample text:

lemma1 = lemmatize(l1, lemma_dict)
print(lemma1)

> ['this', 'be', 'a', 'sample', 'sentence', '.', 'this', 'be', 'one', 'too', '!', 'be', 'this', '?']

Frequency calculation

To calculate frequency, we will create a dictionary that stores each word in our corpus and the number of times that the word is encountered. The sample function below freq_simple() takes one argument (a list of words) and returns a frequency dictionary.

def freq_simple(tok_list):
	#first we define an empty dictionary
	freq = {}

	#then we iterate through our list
	for x in tok_list:
		#the first time we see a particular word we create a key:value pair
		if x not in freq:
			freq[x] = 1
		#when we see a word subsequent times, we add (+=) one to the frequency count
		else:
			freq[x] += 1
	#finally, we return the frequency dictionary
	return(freq)

Now, we can use our function to create a frequency dictionary for the sample text we tokenized and lemmatized above.

freq1 = freq_simple(lemma1)
print(freq1)

> {'this': 3, 'be': 3, 'a': 1, 'sample': 1, 'sentence': 1, '.': 1, 'one': 1, 'too': 1, '!': 1, '?': 1}

As we can see above, “this” and “be” each occurred three times, while all other words only occurred once.

Putting it all together: Creating a corpus frequency list

In this section, we will create a lemmatized frequency list for the Brown corpus. To do so, we will write a function that:

Finds all relevant files in a folder (i.e., all of our corpus documents)
Reads each file
Tokenizes each file
Lemmatizes each file
Calculates frequency figures for the entire corpus_list
Returns a frequency dictionary

While this may seem like a lot to do, we have already created most of the building blocks. We will use our tokenize() and lemmatize() functions. We will also integrate pieces of the code of our freq_simple() function.

In addition, will use a new module, glob() which creates lists of files that match certain criteria.

import glob
def corpus_freq(dir_name,lemma_d):
	freq = {} #create an empty dictionary to store the word : frequency pairs

	#create a list that includes all files in the dir_name folder that end in ".txt"
	filenames = glob.glob(dir_name + "/*.txt")

	#iterate through each file:
	for filename in filenames:
		#open the file as a string
		text = open(filename, errors = "ignore").read()
		#tokenize text using our tokenize() function
		tokenized = tokenize(text)
		#lemmatize text using the lemmatize() function
		lemmatized = lemmatize(tokenized,lemma_d)

		#iterate through the lemmatized text and add words to the frequency dictionary
		for x in lemmatized:
			#the first time we see a particular word we create a key:value pair
			if x not in freq:
				freq[x] = 1
			#when we see a word subsequent times, we add (+=) one to the frequency count
			else:
				freq[x] += 1

	return(freq)

Now, lets try out our function. To do so, download brown_corpus.zip and put the folder in your working directory. For Windows users, you may have two folders named “brown_corpus” (one within the other). If so, make sure to take the folder that has 15 text files in it and put it directly in your working directory (i.e., not inside another folder). Then, we can run the following code:

brown_freq = corpus_freq("brown_corpus",lemma_dict)
print(brown_freq["be"])
print(brown_freq["climb"])
print(brown_freq["awesome"])

> 43817
> 68
> 4

Although frequency dictionaries can be very useful in a variety of applications, we may also want to look at a sorted version of the list, and we may also want to write the frequency list to a file. We will complete these final (for now) steps below.

import operator #this module will help us convert our dictionary into an ordered structure
#we won't take the time to completely break it down, but the following code sorts our dictionary by value (i.e., by frequency) in descending order
sorted_brown = sorted(brown_freq.items(),key=operator.itemgetter(1),reverse = True)

#print the first 20 items in our list
for x in sorted_brown[:20]:
	print(x[0],"\t",x[1]) #print the word, a tab, and then the frequency

> the      69971
,        58334
.        54328
be       43817
of       36412
a        30641
and      28853
to       26158
in       21341
he       19422
'        18674
it       13043
have     12437
that     10787
for      9491
``       8837
i        8474
they     8264
with     7289
on       6741

Writing our sorted list to a file

We will write a simple tab-delimited text below.

def freq_writer(freq_list,filename):
	outf = open(filename, "w")
	outf.write("word\tfrequency") #write header

	for x in freq_list:
		outf.write("\n" + x[0] + "\t" + str(x[1])) #newline character + word + tab + string version of Frequency
	outf.flush() #flush buffer
	outf.close() #close file

#this will write sorted_brown to a file named "brown_freq.txt" in your working directory
freq_writer(sorted_brown,"brown_freq.txt")

As you will see as you look at the frequency list, there may be some items that we want to ignore (e.g., commas, periods, and other punctuation), but otherwise our scripts worked as intended!

Exercise

For this exercise, you will create a python script that will:

Load corpus files
Read, tokenize, and lemmatize each file
Calculate frequency figures for the entire corpus
Write the frequency list to a file

You can, of course use the code that we developed in the tutorial to accomplish the above tasks, but you must alter the code to ignore and/or delete punctuation marks. The tutorial did not explicitly show you how to do this, but you should be able to adapt what we learned to accomplish this task.

For this exercise, you can use any corpus of your choosing EXCEPT for the Brown corpus. If you aren’t familiar with other corpora, you can use this corpus of transcribed L2 speech

You will submit a .zip folder that includes a) your corpus, b) your python script (be sure to include comments so I know what each line in your script does), and c) your frequency list.