Python Tutorial 6: Collocation

(updated 10-14-2020)

Concordancing is a core corpus analysis that is essentially qualitative in nature. One quantitative extension of concordancing is collocation analysis, wherein statistics based on the frequency of cooccurrence are used to highlight linguistic patterns.

In this tutorial, we will build on previously covered corpus analyses, such as frequency and concordancing. However, we will add one more analytical step: The calculation of strength of association statistics.

Collocation analysis

Our goal will be to create a series of functions that will:

generate the corpus frequency for all words in a corpus
generate the corpus frequency for words in a particular lexical context
calculate statistics that provide information about the strength of association between two lexical items

Tokenizing

First, we will need to tokenize our texts. In this case, we will use a slightly improved version of the tokenize() function from Python Tutorial 4 that includes a more complete list of punctuation marks. We can of course also choose to lemmatize our texts, but for this tutorial we won’t do that.

def tokenize(input_string): #input_string = text string
	tokenized = [] #empty list that will be returned

	#these are the punctuation marks in the Brown corpus + '"'
	punct_list = ['-',',','.',"'",'&','`','?','!',';',':','(',')','$','/','%','*','+','[',']','{','}','"']

	#this is a sample (but potentially incomplete) list of items to replace with spaces
	replace_list = ["\n","\t"]

	#This is a sample (but potentially incomplete) list if items to ignore
	ignore_list = [""]

	#iterate through the punctuation list and delete each item
	for x in punct_list:
		input_string = input_string.replace(x, "") #instead of adding a space before punctuation marks, we will delete them (by replacing with nothing)

	#iterate through the replace list and replace it with a space
	for x in replace_list:
		input_string = input_string.replace(x," ")

	#our examples will be in English, so for now we will lower them
	#this is, of course optional
	input_string = input_string.lower()

	#then we split the string into a list
	input_list = input_string.split(" ")

	for x in input_list:
		if x not in ignore_list: #if item is not in the ignore list
			tokenized.append(x) #add it to the list "tokenized"

	return(tokenized)

Frequency

In Python Tutorial 4 we created a frequency function that defined a new dictionary, then added to the newly defined dictionary. In this tutorial, we will modify this slightly by creating a function that updates a pre-existing dictionary instead of creating a new one. This will allow us to easily calculate frequency values for a number of different pieces of a text (e.g., left context, right context, different forms of our search term(s), etc.).

#here we use a version of the freq_simple() function that updates a pre-existing dictionary instead of returning a new dictionary
def freq_update(tok_list,freq_dict): #this takes a list (tok_list) and a dictionary (freq_dict) as arguments
	for x in tok_list: #for x in list
		if x not in freq_dict: #if x not in dictionary
			freq_dict[x] = 1 #create new entry
		else: #else: add one to entry
			freq_dict[x] += 1

Below, we use the freq_update() function with a sample dictionary:

sampd = {"a" : 1} #define dictionary, include one key-value pair
samp_list = ["a","b"] #new list of items
freq_update(samp_list,sampd) #update dictioary based on the new list

> {'a': 2, 'b': 1}

Calculating context frequency

Now, we will use our tokenize() and freq_update() functions, along with pieces of the concord() and concord_regex() functions from Python Tutorial 5 to create a function that calculates the frequency of all target item hits, collocates in the left context, collocates in the right context, total collocate frequency, and total frequency for all items in a text.

Note that we will use the built-in type() function, which tells us the Python type (e.g., str, list, int, float) of a particular object. This will allow us to use a single function with targets that are lists OR regular expressions.

Our function context_freq() will take the follow arguments:

tok_list a tokenized list of strings
target a list of the target strings (e.g., words) OR a regular expression string
nleft length of preceding context (in number of words)
nright length of following context (in number of words)

context_freq() will return a dictionary that consists of five dictionaries:

“left_freq” is the frequency of collocates in the left context
“right_freq” is the frequency of collocates in the right context
“combined_freq” is the frequency of collocates in either context
“target_freq” is the frequency of each target hit
“corp_freq” is the frequency for all word in the corpus

import re
def context_freq(tok_list,target,nleft = 10,nright = 10):
	left_freq = {} #frequency of items to the left
	right_freq = {} #frequency of items to the right
	combined_freq = {} #combined left and right frequency
	target_freq = {} #frequency dictionary for all target hits
	corp_freq = {} #total frequency for all words

	for idx, x in enumerate(tok_list): #iterate through token list using the enumerate function. idx = list index, x = list item
		freq_update([x],corp_freq) #here we update the corpus frequency for all words. Note that we put x in a one-item list [x] to conform with the freq_update() parameters (it takes as list as an argument)

		hit = False #set Boolean value to False - this will allow us to use a list or a regular expression as a search term
		if type(target) == str and re.compile(target).match(x) != None: #If the target is a string (i.e., a regular expression) and the regular expression finds a match in the string (the slightly strange syntax here literally means "if it doesn't not find a match")
			hit = True #then we have a search hit
		elif type(target) == list and x in target: #if the target is a list and the current word (x) is in the list
			hit = True #then we have a search hit

		if hit == True: #if we have a search hit:

			if idx < nleft: #deal with left context if search term comes early in a text
				left = tok_list[:idx] #get x number of words before the current one (based on nleft)
				freq_update(left,left_freq) #update frequency dictionary for the left context
				freq_update(left,combined_freq) #update frequency dictionary for the all contexts
			else:
				left = tok_list[idx-nleft:idx] #get x number of words before the current one (based on nleft)
				freq_update(left,left_freq) #update frequency dictionary for the left context
				freq_update(left,combined_freq) #update frequency dictionary for the all contexts
			t = x
			freq_update([t],target_freq) #update frequency dictionary for target hits; Note that we put x in a one-item list [x] to conform with the freq_update() parameters (it takes as list as an argument)

			right = tok_list[idx+1:idx+nright+1] #get x number of words after the current one (based on nright)
			freq_update(right,right_freq) #update frequency dictionary for the right context
			freq_update(right,combined_freq) #update frequency dictionary for the all contexts

	output_dict = {"left_freq" : left_freq,"right_freq" : right_freq, "combined_freq" : combined_freq, "target_freq" : target_freq, "corp_freq" : corp_freq}
	return(output_dict)

Now, we will test our function on a sample string, which is an excerpt from the NICT JLE corpus:

sample = """ I look sometimes he rides a bicycle everyday
Yes But I don't know her and XXX05's mother because I have two young two child, but my child is very old tha older than hers child
 I like the exercise  very well  I go to the training gym everyday  But   I  m I take care of two child and my husband After that, I go to the gym and department store, supermarket, and that's all
Yes I  my husband we want to the play golf with family, but I don't like a sports  He said "You should go to the gym"  I go to the gym for five years old
Yes  A little But I cannot play  golf very well
 Once a week I practice  once a week with my sons
 son is thirteen years old
Yes But he plays the golf better than me
 But my husband is best golf  daughter is   my daughter is is  play golf than   sometimes  
"""

#use the context_freq() function to search for collocates of "golf" (with 5 words of left context and 5 words of right context)
golf_freqs = context_freq(tokenize(sample),["golf"],5,5)
print(golf_freqs["target_freq"]) #print the "target_freq" dictionary
print(golf_freqs["left_freq"]) #print the "left_freq" dictionary

As we can see below, golf occurs five times in our text. In the left (preceding context), we see that but, is, and play occur most frequently (three times each).

> {'golf': 5}
{'we': 1, 'want': 1, 'to': 1, 'the': 2, 'play': 3, 'little': 1, 'but': 3, 'i': 1, 'cannot': 1, 'yes': 1, 'he': 1, 'plays': 1, 'my': 2, 'husband': 1, 'is': 3, 'best': 1, 'daughter': 1}

As we proceed, we will continue to create functions that return one or more dictionaries with words as keys and frequency or strength of association statistics as values. To preview these dictionaries as sorted lists, we could continue to use the sorted() function and the variations of the freq_writer() function (from Python Tutorial 4) to preview, save, and disseminate our results. However, in the long run it will be more efficient to make a multipurpose function for completing these tasks. The head() function below is an extended adaptation of the function with the same name in R.

Our Python version of the head() function takes six arguments:

stat_dict is a dictionary that consist of {string : number} key : value pairs (e.g., a frequency dictionary)
hits is the number of items to include (default is top 20 items). If you want to include all items in the corpus, choose a very large number (e.g., 10000000000).
hsort is a Boolean value. By default, this is True (and the dictionary is sorted with the highest value first)
output is a Boolean value. By default it is False. If True, the function will return a sorted list (instead if just printing it)
filename by default is None. If a filename is provided (e.g., results.txt), a list will be written to the working directory.
sep is a string. By default, this is a “\t” character. It is only used when lists are written to a file.

By default, the head() function prints a sorted list of items in the stat_dict. It can also return a sorted list and/or write the list to a file.

import operator
def head(stat_dict,hits = 20,hsort = True,output = False,filename = None, sep = "\t"):
	#first, create sorted list. Presumes that operator has been imported
	sorted_list = sorted(stat_dict.items(),key=operator.itemgetter(1),reverse = hsort)[:hits]

	if output == False and filename == None: #if we aren't writing a file or returning a list
		for x in sorted_list: #iterate through the output
			print(x[0] + "\t" + str(x[1])) #print the sorted list in a nice format

	elif filename is not None: #if a filename was provided
		outf = open(filename,"w") #create a blank file in the working directory using the filename
		outf.write("item\tstatistic") #write header
		for x in sorted_list: #iterate through list
			outf.write("\n" + x[0] + sep + str(x[1])) #write each line to a file using the separator
		outf.flush() #flush the file buffer
		outf.close() #close the file

	if output == True: #if output is true
		return(sorted_list) #return the sorted list

#print top 10 items in the left context dictionary
head(golf_freqs["left_freq"],hits = 10)

> play    3
but     3
is      3
the     2
my      2
we      1
want    1
to      1
little  1
i       1

Calculating context frequencies for a corpus

We will now update our context_freq() function so that it will calculate context frequencies (etc.) for an entire function. We will also set some default values to make our function a little easier to use.

We will call our updated function corpus_context_freq(), which will take the following arguments:

dirname this is the name of the folder in which the corpus files reside
tok_list a tokenized list of strings
target a list of the target strings (e.g., words) OR a regular expression string
nleft length of preceding context (in number of words; default value is 5)
nright length of following context (in number of words; default value is 5)

context_freq() will return a dictionary that consists of five dictionaries:

“left_freq” is the frequency of collocates in the left context
“right_freq” is the frequency of collocates in the right context
“combined_freq” is the frequency of collocates in either context
“target_freq” is the frequency of each target hit
“corp_freq” is the frequency for all word in the corpus

import re
def corpus_context_freq(dir_name,target,nleft = 5,nright = 5): #if we wanted to add lemmatization, we would need to add an argument for a lemma dictinoary
	left_freq = {} #frequency of items to the left
	right_freq = {} #frequency of items occuring to the right
	combined_freq = {} #combined left and right frequency
	target_freq = {} #frequency dictionary for all target hits
	corp_freq = {} #total frequency for all words

	#create a list that includes all files in the dir_name folder that end in ".txt"
	filenames = glob.glob(dir_name + "/*.txt")
	#print(filenames)
	#iterate through each file:
	for filename in filenames:
		#open the file as a string
		text = open(filename, errors = "ignore").read()
		#tokenize text using our tokenize() function
		tok_list = tokenize(text)

		#if we wanted to lemmatize our text, we would use the lemmatize function here

		for idx, x in enumerate(tok_list): #iterate through token list using the enumerate function. idx = list index, x = list item
			freq_update([x],corp_freq) #here we update the corpus frequency for all words. Note that we put x in a one-item list [x] to conform with the freq_update() parameters (it takes as list as an argument)

			hit = False #set Boolean value to False - this will allow us to use a list or a regular expression as a search term
			if type(target) == str and re.compile(target).match(x) != None: #If the target is a string (i.e., a regular expression) and the regular expression finds a match in the string (the slightly strange syntax here literally means "if it doesn't not find a match")
				hit = True #then we have a search hit
			elif type(target) == list and x in target: #if the target is a list and the current word (x) is in the list
				hit = True #then we have a search hit

			if hit == True: #if we have a search hit:

				if idx < nleft: #deal with left context if search term comes early in a text
					left = tok_list[:idx] #get x number of words before the current one (based on nleft)
					freq_update(left,left_freq) #update frequency dictionary for the left context
					freq_update(left,combined_freq) #update frequency dictionary for the all contexts
				else:
					left = tok_list[idx-nleft:idx] #get x number of words before the current one (based on nleft)
					freq_update(left,left_freq) #update frequency dictionary for the left context
					freq_update(left,combined_freq) #update frequency dictionary for the all contexts
				t = x
				freq_update([t],target_freq) #update frequency dictionary for target hits; Note that we put x in a one-item list [x] to conform with the freq_update() parameters (it takes as list as an argument)

				right = tok_list[idx+1:idx+nright+1] #get x number of words after the current one (based on nright)
				freq_update(right,right_freq) #update frequency dictionary for the right context
				freq_update(right,combined_freq) #update frequency dictionary for the all contexts

	output_dict = {"left_freq" : left_freq,"right_freq" : right_freq, "combined_freq" : combined_freq, "target_freq" : target_freq, "corp_freq" : corp_freq}
	return(output_dict)

Now, we will test our function on the Brown corpus (don’t forget to set your working directory!):

# search for words that start with "investigat"
brown_context_freq = corpus_context_freq("brown_corpus","investigat.*")
head(brown_context_freq["target_freq"]) #get frequency of various target hits

Below, we see that investigation is the most frequent target hit, followed by investigations and investigated (along with others).

> investigation   51
investigations  22
investigated    18
investigators   13
investigate     11
investigating   8
investigator    4
investigative   3
investigates    1

We can also look at the combined collocate frequency:

#get ten most frequent collocates regardless of context
head(brown_context_freq["combined_freq"],hits = 10)

> the     102
of      75
and     43
in      37
to      34
a       25
by      21
have    14
on      11
for     11

And the left and right context frequencies specifically by using the head() function with “left_freq” and “right_freq” respectively.

We can also check the overall frequency of words in the corpus:

#get ten most frequent words in the corpus
head(brown_context_freq["corp_freq"],hits = 10)

> the     69971
of      36412
and     28853
to      26158
a       23308
in      21341
that    10594
is      10109
was     9815
he      9548

Strength of association

As we saw in the previous section, the most frequent collocates of forms of investigate were also among the most frequent words in the corpus. In short, the co-occurence of the our target item and these frequent words may be a function of their raw frequencies and may not tell us much about the relationship between the two words specifically.

Next, we will create a function soa()that calculates the strength of association between items in an attempt to control for the raw frequency of each item in the corpus.

soa will take two arguments:

freq_dict is a dictionary of frequency dictinoaries generated by the corpus_context_freq() function
cut_off is a minimum frequency cut-off for the calculation of strength of association. Any items with frequency values below this number will be ignored. The default value is five.

soa will return a dictionary of dictionaries that consist of various strength of association measures. These include (see code below for all equations):

“mi” Mutual Information (MI) score; highlights restrictive collocations
“tscore” T score (T); highlights frequent collocations
“faith_coll_cue” Faithfulness; Probability of seeing the target item given the presence of the collocate
“faith_target_cue” Faithfulness; Probability of seeing the collocate given the presence of the target item
“deltap_coll_cue” Delta P; Probability of seeing the target item given the presence of the collocate MINUS the probability of seeing the target item given the presence of any word other than the collocate
“deltap_target_cue” Delta P;Probability of seeing the collocate given the presence of the target item MINUS the probability of seeing the collocate given the presence of any word other than the target item

import math
def soa(freq_dict,cut_off = 5):
	mi = {}
	tscore = {}
	faith_coll_cue = {}
	faith_target_cue = {}
	deltap_coll_cue = {}
	deltap_target_cue = {}

	corpus_size = sum(freq_dict["corp_freq"].values()) #get the size of the corpus by summing the frequency of all words . This will stay consistent for all iterations below
	target_freq = sum(freq_dict["target_freq"].values()) #get the total number of corpus hits for all forms of the target item. This will stay consistent for all iterations below

	#iterate through context hits
	for collocate in freq_dict["combined_freq"]:
		observed = freq_dict["combined_freq"][collocate] #frequency of target coocurring with collocate
		collocate_freq = freq_dict["corp_freq"][collocate] #Total frequency of collocate in corpus

		if freq_dict["combined_freq"][collocate] >= cut_off: #check to make sure that the collocate occurs frequently enough to surpass the threshold
			expected = ((target_freq * collocate_freq)/corpus_size)

			mi[collocate] =  math.log2(observed/expected) #calculate MI score and add it to dict


			tscore[collocate] = (observed-expected)/(math.sqrt(observed)) #calcuate T score and add it to dict

			#	 y  -y
			#	_______
			# x | a | b
			#	 ___|___
			# -x| c | d
			#
			# deltap P(outcome|cue) - P(outcome|-cue)
			# delta P(y|x) = (a/(a+b)) - (c/(c+d))
			# delta P(x|y) = (a/(a+c)) - (b/(b+d))
			#x = collocate
			#y = target
			a = observed
			b = collocate_freq - a
			c = target_freq - a
			d = corpus_size - (a+b+c)

			#finish this!
			faith_coll_cue[collocate] = (a/(a+b)) #P(target | collocate)
			faith_target_cue[collocate] = (a/(a+c)) ##P(collocate | target)

			deltap_coll_cue[collocate] = (a/(a+b)) - (c/(c+d)) #P(target | collocate) - P(target | -collocate)
			deltap_target_cue[collocate] = (a/(a+c)) - (b/(b+d)) #P(collcate | target) - P(collocate | -target)

	#create output dictionary:
	output_dict = {"mi" : mi,"tscore" : tscore, "faith_coll_cue" : faith_coll_cue, "faith_target_cue" : faith_target_cue, "deltap_coll_cue" : deltap_coll_cue, "deltap_target_cue" : deltap_target_cue}
	return(output_dict)

Now, we can use our soa() function to look for collocates of all items that begin with “investigat” using the dictionary (brown_context_freq)that we previously generated. We will then look at the lists generated by the various strength of association statistics, starting with MI:

brown_soa = soa(brown_context_freq)
head(brown_soa["mi"],hits = 10)

The results indicate that “bureau” is the most strongly associated item (according to MI) score, likely as part of a name (i.e., the Federal Bureau of Investigation; we would need to do a concordance search to check).

> bureau  10.49096922934631
original        8.552661551752552
federal 8.29664757359653
report  8.281645410257283
committee       8.109879061990803
city    6.620776576677164
number  6.356519029573929
used    6.247167914950834
other   5.769181163083601
been    5.645770956144033

Next, we will take a look at some of the results with regard to T score:

head(brown_soa["tscore"],hits = 10)

The T score results highlight frequent collocations such as “the”. This may be due to the fact that the most frequent versions of “investigat” were nouns (but again, we need to look at concordance lines to confirm this)

the     13.203587775764168
of      10.759308340956522
and     8.468529620947434
to      8.329895145485272
in      8.1601389590553
a       6.259020362668412
by      5.161870288339875
have    4.689569565875919
as      4.6001553312696615
was     4.419892835586372

When Faithfulness (collocate cue) is used, we get a list that is identical to that generated by MI for the first ten items (note that this is not always true):

head(brown_soa["faith_coll_cue"],hits = 10)

The results here indicate that when “bureau” occurs in the corpus, a form of “investigat” will occur 18.6% of the time.

bureau  0.18604651162790697
original        0.04854368932038835
federal 0.04065040650406504
report  0.040229885057471264
committee       0.03571428571428571
city    0.01272264631043257
number  0.01059322033898305
used    0.009819967266775777
other   0.007050528789659225
been    0.006472491909385114

When Faithfulness (target cue) is used, we get a list that is identical to that generated by T for the first ten items (note that this is not always true):

head(brown_soa["faith_target_cue"],hits = 10)

The results here indicate that when a form of “investigat” occurs in the corpus there is a (somewhat nonsensical) 146.56% chance that “the” will occur. The probability value here is higher than 1 because “the” can occur in the context window more than once!

> the     1.465648854961832
of      0.9541984732824428
and     0.6030534351145038
to      0.5801526717557252
in      0.549618320610687
a       0.3435114503816794
by      0.21374045801526717
have    0.17557251908396945
as      0.17557251908396945
was     0.16793893129770993

Updated concordancer

In Python Tutorial 5, we made two versions of a rather simple concordancer. For a version of the concordancer that allows for n-gram searches and target + context searches, see this updated version.

Exercises

Using the words investigation and investigations as the search term, identify the most strongly associated collocates that occur immediately before the search term with regard to MI and T score (i.e., your left context should be set to one and your right context should be set to 0). Choose one of the items from each of your collocate lists and hypothesize about the nature of the relationship between the collocate and the target word. Is the relationship grammatical? idiomatic? something else?
Now do the same thing, but set your collocate search to include only the word immediately following the search term.
Pick a search term and corpus of your choosing. You are welcome to use the Brown corpus, but you are also welcome to use some other corpus. Determine the most strongly associated items with your search term based on two statistics of your choosing. Use two span settings (left and right should both be no more than ten and no less than two), and compare your results. What specific effects does search span seem to have on your findings?