corpus-toolkit Documentation Page
This page includes details on the arguments each function in the corpus-toolkit package takes.
Note that this page is in progress! All (heavily commented) code is also available here
Default lists
default_punct_list = [",",".","?","'",'"',"!",":",";","(",")","[","]","''","``","--"] #we can add more items to this if needed
default_space_list = ["\n","\t"," "," "," "]
ignore_list = [""," ", " ", " ", " "] #list of items we want to ignore in our frequency calculations
ldcorpus()
This function will load all files that match a certain filename ending (e.g., “.txt”) in a folder. By default it loads all files ending in “.txt” and prints the name of each file being loaded.
ldcorpus() Is a generator function that loads all corpus files in a folder. It takes three arguments (two of which have default values):
- dirname (string variable) This is the name of the directory that one’s files are in. It will not gather files in nested folders.
- ending (string variable) This is the ending for your target filenames. By default, this is “.txt”.
- verbose (Boolean variable) This determines whether filenames are printed to the console when loading. By default, this is set to “True”
def ldcorpus(dirname,ending = ".txt",verbose = True):
filenames = glob.glob(dirname + "/*" + ending) #gather all text names
nfiles = len(filenames) #get total number of files in corpus
fcount = 0 #counter for corpus files
for x in filenames:
fcount +=1 #update file count
sm_fname = x.split(dirsep)[-1] # get filename
if verbose == True:
print("Processing", sm_fname, "(" + str(fcount), "of", nfiles,"files)")
text = open(x, errors = "ignore").read()
yield(text)
tokenize()
tokenize() Is a generator function that tokenizes a list of texts. It takes eight arguments (seven of which have default values):
- corpus (list of texts) This is a list of corpus texts (strings)
- remove_list (list of characters) This is a list of characters to be removed from each text. By default this is the
default_punct_list
- space_list (list of characters) This is a list of characters (and character sequences) to be replaced by a single space. By default this is the
default_space_list
- split_token (string variable) This is the character used to split the text string. By default this is a single space
" "
. - lower (Boolean variable) This is a Boolean value that determines whether all characters in each text are set to lower case. By default, this is true.
- lemma (dictionary) This is the lemma dictionary used to lemmatize tokens in each text that consists of lower-case unlemmatized words as keys and lemmas as values. By default, this is a pre-loaded lemma list. If set to False, then texts are not lemmatized.
- ngram (Boolean variable or integer) This sets the n-gram length for tokenization. By default, this is set to False.
- ngrm-connect (string variable) This sets the character used to join words in an ngram. By default, this is set to
"__"
def tokenize(corpus, remove_list = default_punct_list, space_list = default_space_list, split_token = " ", lower = True, lemma=lemma_dict,ngram = False,ngrm_connect = "__"):
for text in corpus: #iterate through each string in the corpus_list
for item in remove_list:
text = text.replace(item,"") #replace each item in list with "" (i.e., nothing)
for item in space_list:
text = text.replace(item," ")
if lower == True:
text = text.lower()
#then we will tokenize the document
tokenized = text.split(split_token) #split string into list using the split token (by default this is a space " ")
if lemma != False: #if lemma isn't False
tokenized = lemmatize(tokenized,lemma)
if ngram != False:
tokenized = ngrammer(tokenized,ngram,ngrm_connect)
yield(tokenized)
concord()
The concord() function takes a list of tokenized corpus texts (i.e., a list of lists) such as that generated by the tokenize() function, a list of search terms, and (optionally) a list of collocates and returns a list of concordance lines (which can optionally be written to a file). The concord() function takes nine arguments (seven of which have default values):
- tokenized_corp (list of lists) This is a list consisting of tokenized texts (represented as lists of strings).
- target (list of strings) This is a list of search strings. If regex = True, search terms are interpreted as regular expressions.
- nhits (integer) This indicates how many hits should be returned (if nhits > number of actual corpus hits, the findings represent a random sample). By default, this is 25.
- nleft (integer) This indicates how many tokens should be included in the left context. By default, this is 10.
- nright (integer) This indicates how many tokens should be included in the right context. By default, this is 10.
- collocates (list of strings) This is secondary list of search strings used to explore contexts in which collocates occur. If regex = True, search terms are interpreted as regular expressions. By default this is an empty list (and no collocates are included in the search).
- outname (string) This indicates the name of the output file. by default this is an empty string (and no file is written).
- sep (string) This indicates how tokens are separated when concordance lines are written to a file. By default, this is a tab character (“\t”).
- regex (Boolean value) This indicates whether search terms (and collocates) are treated as regular expressions. By default, this is set to “False”.
#Note that the concord() function relies heavily on the concord_text() function, which is included below.
def concord_text(tok_list,target,nleft,nright,collocates = [],regex = False):
hits = [] #empty list for search hits
for idx, x in enumerate(tok_list): #iterate through token list using the enumerate function. idx = list index, x = list item
match = False
if regex == False:
if x in target: #if the item matches one of the target items
match = True
if regex == True:
for item in target:
if re.compile(item).match(x) != None:
match = True
if match == True:
if idx < nleft: #deal with left context if search term comes early in a text
left = tok_list[:idx] #get x number of words before the current one (based on nleft)
else:
left = tok_list[idx-nleft:idx] #get x number of words before the current one (based on nleft)
t = x #set t as the item
right = tok_list[idx+1:idx+nright+1] #get x number of words after the current one (based on nright)
if len(collocates) == 0: #if no collocates are defined
hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words
else:
colmatch = False #switch to
if regex == False:
for y in left + right:
if y in collocates:
colmatch = True
if regex == True:
for y in left + right:
for item in collocates:
if re.compile(item).match(y) != None:
colmatch = True
if colmatch == True:
hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words
return(hits)
def concord(tokenized_corp,target,nhits=25,nleft=10,nright=10,collocates = [], outname = "",sep = "\t", regex = False):
hits = []
for text in tokenized_corp:
for hit in concord_text(text,target,nleft,nright,collocates,regex):
hits.append(hit)
# now we generate the random sample
if len(hits) <= nhits: #if the number of search hits are less than or equal to the requested sample:
print("Search returned " + str(len(hits)) + " hits.\n Returning all " + str(len(hits)) + " hits")
if len(outname) > 0:
write_concord(outname,hits,sep)
return(hits) #return entire hit list
else:
print("Search returned " + str(len(hits)) + " hits.\n Returning a random sample of " + str(nhits) + " hits")
if len(outname) > 0:
write_concord(outname,hits,sep)
return(random.sample(hits,nhits)) #return the random sample
frequency()
The frequency() function takes a list of tokenized corpus texts (i.e., a list of lists) such as that generated by the tokenize() function and returns a frequency (or range) dictionary with linguistic items (e.g., words) as keys and frequency (or range) values as values. The frequency() function takes four arguments (three of which have default values):
- corpus_list (list of lists) This is a list consisting of tokenized texts (represented as lists of strings).
- ignore (list of strings) This is a list of strings to ignore when calculating frequency. By default, this is the pre-defined
ignore_list
. - calc (string) This indicates whether the function will produce frequency or range values. The options are
freq
(default) orrange
. - normed (Boolean value) This indicates whether frequencies are normed (per million words) or represent raw frequencies. By default, this is set to
False
(raw frequencies are reported).
def frequency(corpus_list, ignore = ignore_list, calc = 'freq', normed = False): #options for calc are 'freq' or 'range'
freq_dict = {} #empty dictionary
for tokenized in corpus_list: #iterate through the tokenized texts
if calc == 'range': #if range was selected:
tokenized = list(set(tokenized)) #this creates a list of types (unique words)
for token in tokenized: #iterate through each word in the texts
if token in ignore_list: #if token is in ignore list
continue #move on to next word
if token not in freq_dict: #if the token isn't already in the dictionary:
freq_dict[token] = 1 #set the token as the key and the value as 1
else: #if it is in the dictionary
freq_dict[token] += 1 #add one to the count
### Normalization:
if normed == True and calc == 'freq':
corp_size = sum(freq_dict.values()) #this sums all of the values in the dictionary
for x in freq_dict:
freq_dict[x] = freq_dict[x]/corp_size * 1000000 #norm per million words
elif normed == True and calc == "range":
corp_size = len(corpus_list) #number of documents in corpus
for x in freq_dict:
freq_dict[x] = freq_dict[x]/corp_size * 100 #create percentage (norm by 100)
return(freq_dict)
head()
The head() function takes a dictionary of word (or n-gram) keys with a statistic (e.g., frequency) as the values and returns a sorted representation of the dictionary. The main purpose of the head() function is to quickly check any lists generated by other functions. It can print the results, save the sorted list as a Python object, or write a sorted list to file. The head() function takes six arguments (five of which have default values):
- stat_dict (dictionary) This is a dictionary of item : statistic key - value pairs (e.g., the output of the frequency() function).
- hits (integer) This is an integer indicating how many items in the sample should be printed. (Note that this is ignored if writing to a file or saving a Python object). By default, this is set to
20
. - hsort (Boolean value) If
True
the values are sorted from largest to smallest. IfFalse
, the values are sorted from smallest to largest. By default, this is set toTrue
. - output (Boolean value) If
True
, the head() function outputs a list consisting of (item, statistic) tuples. If set toTrue
, no items are printed to the console. By default, this is set toFalse
. - filename (string) Providing a filename (e.g., “my_frequency.txt”) string will cause the head() function to write a spreadsheet file including all items and their statistic (e.g., frequency value). By default, this is set to
None
, and no file is written. - sep (string) This sets the character(s) used to separate columns when lists are written to a file. By default, this is a tab character
"\t"
def head(stat_dict,hits = 20,hsort = True,output = False,filename = None, sep = "\t"):
#first, create sorted list. Presumes that operator has been imported
sorted_list = sorted(stat_dict.items(),key=operator.itemgetter(1),reverse = hsort)[:hits]
if output == False and filename == None: #if we aren't writing a file or returning a list
for x in sorted_list: #iterate through the output
print(x[0] + "\t" + str(x[1])) #print the sorted list in a nice format
elif filename is not None: #if a filename was provided
outf = open(filename,"w") #create a blank file in the working directory using the filename
for x in sorted_list: #iterate through list
outf.write(x[0] + sep + str(x[1])+"\n") #write each line to a file using the separator
outf.flush() #flush the file buffer
outf.close() #close the file
if output == True: #if output is true
return(sorted_list) #return the sorted list
tag()
insert description here
def tag(corpus,tp = "upos", lemma = True, pron = False, lower = True, connect = "_",ignore = ["PUNCT","SPACE","SYM"],ngram = False,ngrm_connect = "__"):
#check to make sure a valid tag was chosen
if tp not in ["penn","upos","dep"]:
print("Please use a valid tag type: 'penn','upos', or 'dep'")
else:
for text in corpus:
doc = nlp(text) #use spacy to tokenize, lemmatize, pos tag, and parse the text
text_list = [] #empty list for output
for token in doc: #iterate through the tokens in the document
if token.pos_ in ignore: #if the universal POS tag is in our ignore list, then move to next word
continue
if lemma == True: #if we chose lemma (this is the default)
if pron == False: #if we don't want Spacy's pronoun lemmatization
if token.lemma_ == "-PRON-":
word = token.text.lower() #then use the raw form of the word
else:
word = token.lemma_ #otherwise the word form will be a lemma
else:
word = token.lemma_ #then the word form will be a lemma
else:
if lower == True: #if we we chose lemma = False but we want our words lowered (this is default)
word = token.text.lower() #then lower the word
else:
word = token.text #if we chose lemma = False and lower = False, just give us the word
if tp == None: #if tp = None, then just give the tokenized word (and nothing else)
text_list.append(word)
else:
if tp == "penn":
tagged = token.tag_ #modified penn tag
elif tp == "upos":
tagged = token.pos_ #universal pos tag
elif tp == "dep":
tagged = token.dep_ #dependency relationship
tagged_token = word + connect + tagged #add word, connector ("_" by default), and tag
text_list.append(tagged_token) #add to list
if ngram != False:
text_list = ngrammer(text_list,ngram,ngrm_connect)
yield(text_list) #yield text list
write_corpus()
insert description here
def write_corpus(new_dirname,corpus, dirname = False, ending = "txt"):
name_list = []
if dirname != False:
for x in glob.glob(dirname + "/*" + ending):
simple_name = x.split(dirsep)[-1] #split the long directory name by the file separator and take the last item (the short filename)
name_list.append(simple_name)
try:
os.mkdir(new_dirname + "/") #make the new folder
except FileExistsError: #if folder already exists, then print message
print("Writing files to existing folder")
for i, document in enumerate(corpus): #use enumerate to iterate through the corpus list
if dirname == False:
new_filename = new_dirname + "/" + str(i+1) + "." + ending #create new filename
else:
new_filename = new_dirname + "/" + name_list[i] #create new filename
outf = open(new_filename,"w") #create outfile with new filename
corpus_string = " ".join(document) #turn corpus list into string
outf.write(corpus_string) #write corpus list
outf.flush()
outf.close()
dep_bigram()
insert description here
def dep_bigram(corpus,dep,lemma = True, lower = True, pron = False, dep_upos = None, head_upos = None, dep_text = None, head_text = None):
bi_freq = {} #holder for dependency bigram frequency
dep_freq = {} #holder for depenent frequency
head_freq = {} #holder for head frequency
range_freq = {}
match_sentences = [] #holder for sentences that include matches
def dicter(item,d): #d is a dictinoary
if item not in d:
d[item] = 1
else:
d[item] +=1
textid = 0
for text in corpus: #iterate through corpus filenames
textid+=1
range_list = [] #for range information
#sent_text = "first"
doc = nlp(text) #tokenize, tag, and parse text using spaCy
for sentence in doc.sents: #iterate through sentences
#print(sent_text)
index_start = 0 #for identifying sentence-level indexes later
sent_text = [] #holder for sentence
dep_headi = [] #list for storing [dep,head] indexes
first_token = True #for identifying index of first token
for token in sentence: #iterate through tokens in document
if first_token == True:
index_start = token.i #if this is the first token, set the index start number
first_token = False #then set first token to False
sent_text.append(token.text) #for adding word to sentence
if token.dep_ == dep: #if the token's dependency tag matches the one designated
dep_tg = token.pos_ #get upos tag for the dependent (only used if dep_upos is specified)
head_tg = token.head.pos_ #get upos tag for the head (only used if dep_upos is specified)
if lemma == True: #if lemma is true, use lemma form of dependent and head
if pron == False: #if we don't want Spacy's pronoun lemmatization
if token.lemma_ == "-PRON-":
dependent = token.text.lower() #then use the raw form of the word
headt = token.head.text.lower()
else:
dependent = token.lemma_
headt = token.head.lemma_
else:
dependent = token.lemma_
headt = token.head.lemma_
if lemma == False: #if lemma is false, use the token form
if lower == True: #if lower is true, lower it
dependent = token.text.lower()
headt = token.head.text.lower()
else: #if lower is false, don't lower
dependent = token.text
headt = token.head.text
if dep_upos != None and dep_upos != dep_tg: #if dependent tag is specified and upos doesn't match, skip item
continue
if head_upos != None and head_upos!= head_tg: #if head tag is specified and upos doesn't match, skip item
continue
if dep_text != None and dep_text != dependent: #if dependent text is specified and text doesn't match, skip item
continue
if head_text != None and head_text != headt: #if head text is specified and text doesn't match, skip item
continue
dep_headi.append([token.i-index_start,token.head.i-index_start]) #add sentence-level index numbers for dependent and head
dep_bigram = dependent + "_" + headt #create dependency bigram
range_list.append(dep_bigram) #add to document-level range list
dicter(dep_bigram,bi_freq) #add values to frequency dictionary
dicter(dependent,dep_freq) #add values to frequency dictionary
dicter(headt,head_freq) #add values to frequency dictionary
### this section is for creating a list of sentences that include our hits ###
for x in dep_headi: #iterate through hits
temp_sent = sent_text.copy() #because there may be multiple hits in each sentence (but we only want to display one hit at at time), we make a temporary copy of the sentence that we will modify
depi = sent_text[x[0]] + "_" + dep+ "_dep" #e.g., word_dobj_dep
headi = sent_text[x[1]] + "_" + dep+ "_head" #e.g., word_dobj_head
temp_sent[x[0]] = depi #change dependent word to depi in temporary sentence
temp_sent[x[1]] = headi ##change head word to headi in temporary sentence
temp_sent.append(str(textid)) ## add file inded to sent to indicate where example originated
match_sentences.append(temp_sent) #add temporary sentence to match_sentences for output
for x in list(set(range_list)): #create a type list of the dep_bigrams in the text
dicter(x,range_freq) #add document counts to the range_freq dictionary
bigram_dict = {"bi_freq":bi_freq,"dep_freq":dep_freq,"head_freq": head_freq, "range":range_freq, "samples":match_sentences} #create a dictioary of dictionaries
return(bigram_dict) # return dictionary of dictionaries