Advanced Concordancing
(updated 11-19-2020)
This supplemental code shows one way to conduct concordance searches in more complex situations (e.g., when the target items are n-grams and/or when you want to search for a target item + an item in the context).
Preliminaries
First, we will import necessary packages and define a tokenization function (note that this will likely need to be refined for one’s particular purposes).
import glob
import re
import random
def tokenize(input_string):
tokenized = [] #empty list that will be returned
punct_list = [".", "?","!",",","'"] #this is a sample (but incomplete!) list of punctuation characters
replace_list = ["\n","\t"] #this is a sample (but potentially incomplete) list of items to replace with spaces
ignore_list = [""] #This is a sample (but potentially incomplete) list if items to ignore
for x in punct_list: #iterate through the punctuation list and replace each item with a space + the item
input_string = input_string.replace(x," " + x)
for x in replace_list: #iterate through the replace list and replace it with a space
input_string = input_string.replace(x," ")
input_string = input_string.lower() #our examples will be in English, so for now we will lower them
input_list = input_string.split(" ") #then we split the string into a list
#finally, we ignore unwanted items
for x in input_list:
if x not in ignore_list: #if item is not in the ignore list
tokenized.append(x) #add it to the list "tokenized"
#Then, we return the list
return(tokenized)
Advanced concordancing
As alluded to above, the concord2() function will allow for more complex concordance searches. Target items can be words, n-grams, or a mixture of the two. Searches can also be restricted with regard to particular items (in this case words) in the context.
The concord2() function takes the following arguments:
- tok_list a tokenized list of strings
- target a list of the target strings (e.g., words and/or n-grams)
- nleft length of preceding context (in number of words)
- nright length of following context (in number of words)
- cntxt_search a list of the target strings (e.g., words to search for in the context)
def concord2(tok_list,target,nleft,nright,cntxt_search = None): #target is list of target items, cntxt_search is list of items to search for in context.
hits = [] #empty list for search hits
for idx, x in enumerate(tok_list): #iterate through token list using the enumerate function. idx = list index, x = list item
hit = False #Boolean value to check for hits
ngram = False #whether target item is an ngram
for y in target: #iterate through target items
if len(y.split(" ")) == 1: #if the target item is a word
if x == y: #if the word is a match with the target item
hit = True
break #if we have a hit move on to concordance lines
else:
gram_size = len(y.split(" ")) #length of n-gram
x_ngram = " ".join(tok_list[idx:idx+gram_size]) #make string version of current ngram frame
if x_ngram == y: #check to see if current nogram matches target ngram
hit = True
x = x_ngram #change current item to ngram
ngram = True
break #if we have a hit move on to concordance lines
if hit == True: #if the item matches one of the target items
if idx < nleft: #deal with left context if search term comes early in a text
left = tok_list[:idx] #get x number of words before the current one (based on nleft)
else:
left = tok_list[idx-nleft:idx] #get x number of words before the current one (based on nleft)
t = x #set t as the item
if ngram == False:
right = tok_list[idx+1:idx+nright+1] #get x number of words after the current one (based on nright)
else:
right = tok_list[idx+1+gram_size:idx+gram_size+nright+1]
if cntxt_search == None:
hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words
else:
cntxt_hit = False
for item in right + left:
if item in cntxt_search:
cntxt_hit = True
break
if cntxt_hit == True:
hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words
return(hits)
Now we can use the function to get concordance lines for n-grams:
sample = "I like to eat healthy food. On the other hand, I also really like pizza. But to be precise, on the other hand, I like pepperoni pizza in my hand (right before it goes in my mouth)."
for x in concord2(tokenize(sample),["other hand"], 5,5):
print(x)
[['healthy', 'food', '.', 'on', 'the'], 'other hand', ['i', 'also', 'really', 'like', 'pizza']]
[['be', 'precise', ',', 'on', 'the'], 'other hand', ['i', 'like', 'pepperoni', 'pizza', 'in']]
And, we can constrain our searches so that they only include hits with particular words in the context:
for x in concord2(tokenize(sample),["other hand"], 5,5, cntxt_search = ["precise"]):
print(x)
[['be', 'precise', ',', 'on', 'the'], 'other hand', ['i', 'like', 'pepperoni', 'pizza', 'in']]
Now, we can update our corpus concordance function to include our concord2() function.
The corp_conc2() function takes the following arguments:
- corp_folder name of folder that includes the corpus files (this should be in your working directory!)
- target a list of the target strings (e.g., words and/or n-grams)
- nleft length of preceding context (in number of words)
- nright length of following context (in number of words)
- cntxt_search a list of the target strings (e.g., words to search for in the context)
def corp_conc2(corp_folder,target,nhits,nleft,nright,cntxt_search = None): #cntxt_search is list of items to search for in context
hits = []
filenames = glob.glob(corp_folder + "/*.txt") #make a list of all .txt file in corp_folder
for filename in filenames: #iterate through filename
text = tokenize(open(filename).read())
#add concordance hits for each text to corpus-level list:
for x in concord2(text,target,nleft,nright,cntxt_search): #here we use the concord() function to generate concordance lines
hits.append(x)
# now we generate the random sample
if len(hits) <= nhits: #if the number of search hits are less than or equal to the requested sample:
print("Search returned " + str(len(hits)) + " hits.\n Returning all " + str(len(hits)) + " hits")
return(hits) #return entire hit list
else:
print("Search returned " + str(len(hits)) + " hits.\n Returning a random sample of " + str(nhits) + " hits")
return(random.sample(hits,nhits)) #return the random sample
We can now test our function using the Brown corpus and an n-gram search (this presumes that you have the Brown corpus in your working directory):
brown_otoh = corp_conc2("brown_corpus",["on the other hand"],25,5,5)
for x in brown_otoh:
print(x)
Search returned 58 hits.
Returning a random sample of 25 hits
[['it', 'time', 'and', 'again', '.'], 'on the other hand', ['the', 'women', 'class', 'members', 'appeared']]
[['signal', 'ambiguity', 'or', 'uncertainty', '.'], 'on the other hand', ['facts', 'may', 'be', 'concealed', '--']]
[['.', 'sex', 'was', 'both', '.'], 'on the other hand', ['some', 'unwed', 'mothers', 'had', 'had']]
[['to', 'achieve', 'those', 'goals', '.'], 'on the other hand', ['it', 'is', 'no', 'interference', 'with']]
[['astwood', ',', '1954', ')', '.'], 'on the other hand', ['there', 'are', 'a', 'few', 'antithyroid']]
[['well', 'developed', 'respiratory', 'bronchioles', ','], 'on the other hand', ['appear', 'to', 'be', 'the', 'only']]
[['a', 'busted', 'front', 'spring', '.'], 'on the other hand', ['howsomever', ',', 'maybe', 'you', 'wouldn']]
[['setback', 'to', 'the', 'constitution', '.'], 'on the other hand', ['molesworth', 'was', 'naturally', 'assailed', 'in']]
[['individual', 'objects', '.', 'if', ','], 'on the other hand', ['they', 'opted', 'for', 'representation', ',']]
[['original', 'cession', 'was', 'invalid', '.'], 'on the other hand', ['he', 'did', 'not', 'want', 'to']]
[['real', 'headaches', 'in', 'store', '.'], 'on the other hand', ['the', 'process', 'of', 'obsoleting', 'an']]
[[',', 'dolores', 'would', 'crack', '.'], 'on the other hand', ['if', 'she', 'didn', "'t", 'remove']]
[[',', 'bestial', 'and', 'unworthy', '.'], 'on the other hand', ['wifely', 'supremacy', 'demeans', 'the', 'husband']]
[['happier', 'one', '.', 'research', ','], 'on the other hand', ['has', 'shown', 'many', 'stepmothers', 'to']]
[['be', 'moot', '.', 'if', ','], 'on the other hand', ['it', 'is', 'not', 'settled', ',']]
[['of', 'which', 'it', 'arises', '.'], 'on the other hand', ['we', 'cannot', 'regard', 'artistic', 'invention']]
[['newport', ',', 'and', 'providence', '.'], 'on the other hand', ['dr', '.', 'ezra', 'styles', 'recorded']]
[['not', 'seem', 'very', 'bright', '.'], 'on the other hand', ['to', 'greet', 'them', 'with', 'delight']]
[['cause', 'increased', 'convulsive', 'discharges', '.'], 'on the other hand', ['the', 'temporary', 'reduction', 'in', 'hypothalamic']]
[['that', 'enacted', 'for', '1960', '.'], 'on the other hand', ['the', 'new', 'authority', 'of', '$3']]
[[';', ';', 'while', 'jones', ','], 'on the other hand', ['appeared', 'perfectly', 'confident', 'and', 'ulyate']]
[['heads', 'and', 'two', 'tails', '.'], 'on the other hand', ['they', ',', 'or', 'it', ',']]
[['of', 'time', 'and', 'change', '.'], 'on the other hand', ['christian', 'faith', 'knows', 'that', 'death']]
[['to', 'himself', '.', 'or', ','], 'on the other hand', ['are', 'unlikely', 'facts', 'being', 'stated']]
[['face-to-face', 'group', 'of', 'individuals', '.'], 'on the other hand', ['many', 'a', 'pastor', 'is', 'so']]
We can also test our function with a constrained context:
brown_otoh_modal = corp_conc2("brown_corpus",["on the other hand"],25,5,5,cntxt_search = ["may","would","could","might"])
for x in brown_otoh_modal:
print(x)
Search returned 5 hits.
Returning all 5 hits
[['signal', 'ambiguity', 'or', 'uncertainty', '.'], 'on the other hand', ['facts', 'may', 'be', 'concealed', '--']]
[['would', 'be', 'no', 'epidemic', '.'], 'on the other hand', ['a', 'similar', 'attack', 'might', 'have']]
[['your', 'hands', '--', 'now', '.'], 'on the other hand', ['you', 'may', 'seek', 'his', 'favor']]
[['like', '.', 'his', 'election', ','], 'on the other hand', ['would', 'unquestionably', 'strengthen', 'the', '``']]
[[',', 'dolores', 'would', 'crack', '.'], 'on the other hand', ['if', 'she', 'didn', "'t", 'remove']]