Python Tutorial 7: Keyness
(updated 10-26-2020)
In this tutorial we will create a function that identifies items that occur more frequently in on corpus as compared to another (i.e., “key” items). To this end, we will reuse and revise some functions from previous tutorials. We will also explore a new way to represent texts (via n-grams).
Getting started
First, we will load the head() function from Python Tutorial 6. Please see Python Tutorial 6 for more details on this function. We will be using it to preview the various lists we will generate in this tutorial.
import operator
def head(stat_dict,hits = 20,hsort = True,output = False,filename = None, sep = "\t"):
#first, create sorted list. Presumes that operator has been imported
sorted_list = sorted(stat_dict.items(),key=operator.itemgetter(1),reverse = hsort)[:hits]
if output == False and filename == None: #if we aren't writing a file or returning a list
for x in sorted_list: #iterate through the output
print(x[0] + "\t" + str(x[1])) #print the sorted list in a nice format
elif filename is not None: #if a filename was provided
outf = open(filename,"w") #create a blank file in the working directory using the filename
outf.write("item\tstatistic") #write header
for x in sorted_list: #iterate through list
outf.write("\n" + x[0] + sep + str(x[1])) #write each line to a file using the separator
outf.flush() #flush the file buffer
outf.close() #close the file
if output == True: #if output is true
return(sorted_list) #return the sorted list
Tokenization and n-grams
It is often interesting (and important) to examine linguistic units beyond single words. This is particularly true for keyness analysis. Below, we adapt our tokenize() function from previous tutorials to output items of n words in length. We start by defining a new function ngrammer() that takes a tokenized list as an argument and outputs a list of n-grams.
N-grams
ngrammer() takes three arguments:
- token_list a tokenized list of words
- gram_size number of words to include in n-gram
- separator character (or characters) used to join the n-grams. Be default this is a space (“ “).
def ngrammer(token_list, gram_size, separator = " "):
ngrammed = [] #empty list for n-grams
for idx, x in enumerate(token_list): #iterate through the token list using enumerate()
ngram = token_list[idx:idx+gram_size] #get current word token_list plus words n-words after (this is a list)
if len(ngram) == gram_size: #don't include shorter ngrams that we would get at the end of a text
ngrammed.append(separator.join(ngram)) # join the list of ngram items using the separator (by default this is a space), add to ngrammed list
return(ngrammed) #return list of ngrams
Now, we will test our function using a sample tokenized list:
sampl = ["this", "is", "an", "awesome", "sentence", "about", "pizza"]
bigram_sampl = ngrammer(sampl,2) #create bigram version of tokenized text
print(bigram_sampl)
> ['this is', 'is an', 'an awesome', 'awesome sentence', 'sentence about', 'about pizza']
trigram_sampl = ngrammer(sampl,3)
print(trigram_sampl)
['this is an', 'is an awesome', 'an awesome sentence', 'awesome sentence about', 'sentence about pizza']
Revising tokenize() function to work with n-grams
Now, we will revise our tokenize() function to work with n-grams. We will only need to add a few lines at the end of our previous version and add the arguments needed for the ngrammer() function.
The updated version of the tokenize() function takes three arguments and outputs a tokenized text (a list):
- input_string a raw string consisting of language data
- gram_size number of words to include in n-gram. By default this is one (i.e., the default is to tokenize “normally”)
- separator character (or characters) used to join the n-grams (if gram_size > 1). Be default this is a space (“ “).
def tokenize(input_string,gram_size=1, separator = " "): #input_string = text string
tokenized = [] #empty list that will be returned
#these are the punctuation marks in the Brown corpus + '"'
punct_list = ['-',',','.',"'",'&','`','?','!',';',':','(',')','$','/','%','*','+','[',']','{','}','"']
#this is a sample (but potentially incomplete) list of items to replace with spaces
replace_list = ["\n","\t"]
#This is a sample (but potentially incomplete) list if items to ignore
ignore_list = [""]
#iterate through the punctuation list and delete each item
for x in punct_list:
input_string = input_string.replace(x, "") #instead of adding a space before punctuation marks, we will delete them (by replacing with nothing)
#iterate through the replace list and replace it with a space
for x in replace_list:
input_string = input_string.replace(x," ")
#our examples will be in English, so for now we will lower them
#this is, of course optional
input_string = input_string.lower()
#then we split the string into a list
input_list = input_string.split(" ")
for x in input_list:
if x not in ignore_list: #if item is not in the ignore list
tokenized.append(x) #add it to the list "tokenized"
if gram_size == 1: #if we are looking at single words, simply return tokenized
return(tokenized)
else: #otherwise, return n-gram text, using the ngrammer() function
return(ngrammer(tokenized,gram_size,separator))
Now we can use our tokenize() function to create various versions of a texts:
Single word tokenized:
samps = "This is an awesome sentence about pizza."
tok_samps = tokenize(samps) #tokenize with default gram_size (1)
print(tok_samps)
> ['this', 'is', 'an', 'awesome', 'sentence', 'about', 'pizza']
Bigram tokenized:
bigram_samps = tokenize(samps,2)
print(bigram_samps)
>['this is', 'is an', 'an awesome', 'awesome sentence', 'sentence about', 'about pizza']
Trigram tokenized:
trigram_samps = tokenize(samps,3)
print(trigram_samps)
> ['this is an', 'is an awesome', 'an awesome sentence', 'awesome sentence about', 'sentence about pizza']
Calculating corpus frequency
To calculate keyness, we will need to first create frequency dictionaries for each of our comparison corpora. To do so, we will revise the corpus_freq() function from Python Tutorial 4. In this case, we will add the arguments needed to use our revised version of the tokenize() function and change one line slightly to accommodate these arguments.
The corpus_freq() function takes three arguments:
- dir_name name of folder that holds our corpus files (don’t forget to set your working directory!)
- gram_size number of words to include in n-gram. By default this is one (i.e., the default is to tokenize “normally”)
- separator character (or characters) used to join the n-grams (if gram_size > 1). Be default this is a space (“ “).
import glob
def corpus_freq(dir_name,gram_size = 1,separator = " "):
freq = {} #create an empty dictionary to store the word : frequency pairs
#create a list that includes all files in the dir_name folder that end in ".txt"
filenames = glob.glob(dir_name + "/*.txt")
#iterate through each file:
for filename in filenames:
#open the file as a string
text = open(filename, errors = "ignore").read()
#tokenize text using our tokenize() function
tokenized = tokenize(text,gram_size,separator) #use tokenizer indicated in function argument (e.g., "tokenize()" or "ngramizer()")
#iterate through the tokenized text and add words to the frequency dictionary
for x in tokenized:
#the first time we see a particular word we create a key:value pair
if x not in freq:
freq[x] = 1
#when we see a word subsequent times, we add (+=) one to the frequency count
else:
freq[x] += 1
return(freq) #return frequency dictionary
Now we can generate various frequency lists. Below we will test out our function using the Brown Corpus.
First, we will use the default settings to get a word frequency list:
brown_freq = corpus_freq("brown_corpus")
head(brown_freq,10)
> the 69971
of 36412
and 28853
to 26158
a 23308
in 21341
that 10594
is 10109
was 9815
he 9548
Then bigrams:
brown_bi_freq = corpus_freq("brown_corpus",2)
head(brown_bi_freq,10)
> of the 9739
in the 6055
to the 3500
on the 2482
and the 2256
for the 1858
to be 1718
at the 1660
with the 1543
of a 1480
Then trigrams:
brown_tri_freq = corpus_freq("brown_corpus",3)
head(brown_tri_freq,10)
> one of the 404
the united states 340
as well as 238
some of the 179
out of the 174
the fact that 167
the end of 149
part of the 144
it was a 143
there was a 142
And beyond…:
brown_quad_freq = corpus_freq("brown_corpus",4)
head(brown_quad_freq,10)
> of the united states 111
at the same time 87
the end of the 77
in the united states 70
at the end of 63
the rest of the 58
on the other hand 58
one of the most 58
on the basis of 56
as well as the 48
Calculating keyness
A number of statistical procedures have been suggested for calculating keyness. Our function will calculate three keyness statistics described in Gabrielatos (2018), though our function could be easily extended to include more! In our calculation of the three keyness statistics, we will ignore items that only occur in one of the corpora. However, the function will report these items (and their normalized frequency) in separate dictionaries (see below).
The keyness() function takes two arguments:
- freq_dict1 frequency dictionary for target corpus (raw frequencies)
- freq_dict2 frequency dictionary for comparison corpus (raw frequencies)
and returns a dictionary of dictionaries:
- “log-ratio” log ratio (see Gabrielatos (2018); Hardie (2014))
- “%diff” percent difference (see Gabrielatos (2018); Gabrielatos and Marchi (2011))
- “odds-ratio” odds ratio (see Gabrielatos (2018); Everitt (2002))
- “c1_only” items that only occur in the target corpus (corpus 1)
- “c2_only” items that only occur in the comparison corpus (corpus 2)
import math
def keyness(freq_dict1,freq_dict2): #this assumes that raw frequencies were used. effect options = "log-ratio", "%diff", "odds-ratio"
keyness_dict = {"log-ratio": {},"%diff" : {},"odds-ratio" : {}, "c1_only" : {}, "c2_only":{}}
#first, we need to determine the size of our corpora:
size1 = sum(freq_dict1.values()) #calculate corpus size by adding all of the values in the frequency dictionary
size2 = sum(freq_dict2.values()) #calculate corpus size by adding all of the values in the frequency dictionary
#How to calculate three measures of keyness:
def log_ratio(freq1,size1,freq2,size2): #see Gabrielatos (2018); Hardie (2014)
freq1_norm = freq1/size1 * 1000000 #norm per million words
freq2_norm = freq2/size2 * 1000000 #norm per million words
index = math.log2(freq1_norm/freq2_norm) #calculate log ratio
return(index)
def perc_diff(freq1,size1,freq2,size2): #see Gabrielatos (2018); Gabrielatos and Marchi (2011)
freq1_norm = freq1/size1 * 1000000 #norm per million words
freq2_norm = freq2/size2 * 1000000 #norm per million words
index = ((freq1_norm-freq2_norm) * 100)/freq2_norm #calculate perc_diff
return(index)
def odds_ratio(freq1,size1,freq2,size2): #see Gabrielatos (2018); Everitt (2002)
index = (freq1/(size1-freq1))/(freq2/(size2-freq2))
return(index)
#make a list that combines the keys from each frequency dictionary:
all_words = set(list(freq_dict1.keys()) + list(freq_dict2.keys())) #set() creates a set object that includes only unique items
#if our items only occur in one corpus, we will add them to our "c1_only" or "c2_only" dictionaries, and then ignore them
for item in all_words:
if item not in freq_dict1:
keyness_dict["c2_only"][item] = freq_dict2[item]/size2 * 1000000 #add normalized frequency (per million words) to c2_only dictionary
continue #move to next item in the list
if item not in freq_dict2:
keyness_dict["c1_only"][item] = freq_dict1[item]/size1 * 1000000 #add normalized frequency (per million words) to c1_only dictionary
continue #move to next item on the list
keyness_dict["log-ratio"][item] = log_ratio(freq_dict1[item],size1,freq_dict2[item],size2) #calculate keyness using log-ratio
keyness_dict["%diff"][item] = perc_diff(freq_dict1[item],size1,freq_dict2[item],size2) #calculate keyness using %diff
keyness_dict["odds-ratio"][item] = odds_ratio(freq_dict1[item],size1,freq_dict2[item],size2) #calculate keyness using odds-ratio
return(keyness_dict) #return dictionary of dictionaries
Now we will test our keyness function using two subsets of the Brown corpus. The first will be texts from newspapers (reportage, editorials, and reviews). The second will be texts that represent various types of fiction (general, mystery, science, adventure, and romance). In practice, we would want to use larger corpora (if possible), but for our purposes these two subcorpora will be adequate.
To start, download the two corpora: brown_press.zip and brown_fiction.zip. Then expand the corpora and place them in your working directory.
After the two corpora have been placed in your working directory, we can frequency dictionaries for each
brown_news_freq = corpus_freq("brown_press")
head(brown_news_freq,10)
> the 12711
of 6191
and 4701
to 4453
a 4263
in 3837
is 2016
for 1810
that 1792
s 1309
brown_fic_freq = corpus_freq("brown_fiction")
head(brown_fic_freq,10)
> the 14100
and 6957
to 5947
a 5590
of 5194
he 5111
was 4153
in 3652
i 3356
it 2955
Then, we can use the keyness() function to determine which items (e.g., words, bigrams, etc.) occur more frequently in the news paper corpus as compared to the fiction corpus. We will start by looking at the words that only occur in the newspaper corpus. As we can see, there are a number of proper nouns (e.g., kennedy and khrushchev) along with dates and other words commonly used in newspapers.
brown_key_news_fic = keyness(brown_news_freq,brown_fic_freq) #this will include all of our keyness dictionaries. Note that this is directional (if we switch the frequency dictionaries we will get different but complementary results)
head(brown_key_news_fic["c1_only"],10) #items that only occur in the newspaper corpus (the first frequency list we entered into the keyness() function)
> kennedy 772.398157358065
per 566.7957701476447
khrushchev 433.4320595246695
1960 333.4092765574381
democratic 311.1819914536089
dallas 305.62517017765157
mantle 300.06834890169426
1961 283.39788507382235
jr 277.84106379786505
laos 272.28424252190774
Now, we will look at words that only occur in our corpus of fiction. We see that the most frequent hit is a punctuation mark that we didn’t account for.
head(brown_key_news_fic["c2_only"],10)
> ==== 382.79104601814095
bottle 249.64633435965717
shook 216.3601564450362
jess 195.5562952483981
linda 187.23475076974285
kate 183.07397853041525
laughed 178.91320629108762
scotty 178.91320629108762
curt 166.43088957310476
matsuo 153.94857285512194
Next, we will look at the top hits for words that are shared across the two corpora using percent difference (%diff). The results indicate (for example) that administration occurs with a frequency that is 11,652% higher in the newspaper corpus than the fiction corpus.
head(brown_key_news_fic["%diff"],10)
> administration 11652.632544079483
1 10717.764046254979
soviet 10183.553476069548
berlin 8313.816480420537
communist 8180.263837874182
4 7512.500625142393
election 7245.395340049677
international 7111.842697503319
vote 6844.737412410604
industry 6711.184769864246
We can also look at the words that occur less frequently in the newspaper corpus than the fiction corpus. The results indicate (for example) that drink has a 97% lower frequency in the newspaper corpus than the fiction corpus.
head(brown_key_news_fic["%diff"],10,hsort = False) #reverse order
> drink -97.69736823195935
lips -97.57177013552077
wondered -97.15845441390728
horses -97.15845441390728
nodded -97.09668168377482
cousin -96.96471266940095
stairs -96.82017517746768
ai -96.82017517746768
hell -96.43859619876379
silent -96.3904691203687
Exercises
-
What are the five most frequent quadgrams that only occur in the newspaper corpus? Be sure to report the frequencies.
-
What are the five most frequent quadgrams that only occur in the fiction corpus? Be sure to report the frequencies.
-
What are the ten most “key” trigrams in the newspaper corpus? Be sure to report the keyness values (and method used).
-
What are the ten least “key” trigrams in the newspaper corpus? Be sure to report the keyness values (and method used).
-
Check the frequency of the items identified in Exercise 3 in each corpus. What are some related limitations of the keyness method? How might we mitigate this/these issue(s)? (this was purposefully vague… but think about how frequent an item needs to be across contexts to be both “important” and “useful” - and note that this answer may change depending on purpose). Don’t spend too much time on this question!