corpus-toolkit Documentation Page

This page includes details on the arguments each function in the corpus-toolkit package takes.

Note that this page is in progress! All (heavily commented) code is also available here

Default lists

default_punct_list = [",",".","?","'",'"',"!",":",";","(",")","[","]","''","``","--"] #we can add more items to this if needed
default_space_list = ["\n","\t","    ","   ","  "]
ignore_list = [""," ", "  ", "   ", "    "] #list of items we want to ignore in our frequency calculations

ldcorpus()

This function will load all files that match a certain filename ending (e.g., “.txt”) in a folder. By default it loads all files ending in “.txt” and prints the name of each file being loaded.

ldcorpus() Is a generator function that loads all corpus files in a folder. It takes three arguments (two of which have default values):

dirname (string variable) This is the name of the directory that one’s files are in. It will not gather files in nested folders.
ending (string variable) This is the ending for your target filenames. By default, this is “.txt”.
verbose (Boolean variable) This determines whether filenames are printed to the console when loading. By default, this is set to “True”

def ldcorpus(dirname,ending = ".txt",verbose = True):
		filenames = glob.glob(dirname + "/*" + ending) #gather all text names
		nfiles = len(filenames) #get total number of files in corpus
		fcount = 0 #counter for corpus files
		for x in filenames:
			fcount +=1 #update file count
			sm_fname = x.split(dirsep)[-1] # get filename
			if verbose == True:
				print("Processing", sm_fname, "(" + str(fcount), "of", nfiles,"files)")
			text = open(x, errors = "ignore").read()
			yield(text)

tokenize()

tokenize() Is a generator function that tokenizes a list of texts. It takes eight arguments (seven of which have default values):

corpus (list of texts) This is a list of corpus texts (strings)
remove_list (list of characters) This is a list of characters to be removed from each text. By default this is the default_punct_list
space_list (list of characters) This is a list of characters (and character sequences) to be replaced by a single space. By default this is the default_space_list
split_token (string variable) This is the character used to split the text string. By default this is a single space " ".
lower (Boolean variable) This is a Boolean value that determines whether all characters in each text are set to lower case. By default, this is true.
lemma (dictionary) This is the lemma dictionary used to lemmatize tokens in each text that consists of lower-case unlemmatized words as keys and lemmas as values. By default, this is a pre-loaded lemma list. If set to False, then texts are not lemmatized.
ngram (Boolean variable or integer) This sets the n-gram length for tokenization. By default, this is set to False.
ngrm-connect (string variable) This sets the character used to join words in an ngram. By default, this is set to "__"

def tokenize(corpus, remove_list = default_punct_list, space_list = default_space_list, split_token = " ", lower = True, lemma=lemma_dict,ngram = False,ngrm_connect = "__"):
	for text in corpus: #iterate through each string in the corpus_list
		for item in remove_list:
			text = text.replace(item,"") #replace each item in list with "" (i.e., nothing)
		for item in space_list:
			text = text.replace(item," ")
		if lower == True:
			text = text.lower()
		#then we will tokenize the document
		tokenized = text.split(split_token) #split string into list using the split token (by default this is a space " ")
		if lemma != False: #if lemma isn't False
			tokenized = lemmatize(tokenized,lemma)
		if ngram != False:
			tokenized = ngrammer(tokenized,ngram,ngrm_connect)

		yield(tokenized)

concord()

The concord() function takes a list of tokenized corpus texts (i.e., a list of lists) such as that generated by the tokenize() function, a list of search terms, and (optionally) a list of collocates and returns a list of concordance lines (which can optionally be written to a file). The concord() function takes nine arguments (seven of which have default values):

tokenized_corp (list of lists) This is a list consisting of tokenized texts (represented as lists of strings).
target (list of strings) This is a list of search strings. If regex = True, search terms are interpreted as regular expressions.
nhits (integer) This indicates how many hits should be returned (if nhits > number of actual corpus hits, the findings represent a random sample). By default, this is 25.
nleft (integer) This indicates how many tokens should be included in the left context. By default, this is 10.
nright (integer) This indicates how many tokens should be included in the right context. By default, this is 10.
collocates (list of strings) This is secondary list of search strings used to explore contexts in which collocates occur. If regex = True, search terms are interpreted as regular expressions. By default this is an empty list (and no collocates are included in the search).
outname (string) This indicates the name of the output file. by default this is an empty string (and no file is written).
sep (string) This indicates how tokens are separated when concordance lines are written to a file. By default, this is a tab character (“\t”).
regex (Boolean value) This indicates whether search terms (and collocates) are treated as regular expressions. By default, this is set to “False”.

#Note that the concord() function relies heavily on the concord_text() function, which is included below.
def concord_text(tok_list,target,nleft,nright,collocates = [],regex = False):
	hits = [] #empty list for search hits

	for idx, x in enumerate(tok_list): #iterate through token list using the enumerate function. idx = list index, x = list item
		match = False
		if regex == False:
			if x in target: #if the item matches one of the target items
				match = True
		if regex == True:
			for item in target:
				if re.compile(item).match(x) != None:
					match = True

		if match == True:
			if idx < nleft: #deal with left context if search term comes early in a text
				left = tok_list[:idx] #get x number of words before the current one (based on nleft)
			else:
				left = tok_list[idx-nleft:idx] #get x number of words before the current one (based on nleft)

			t = x #set t as the item
			right = tok_list[idx+1:idx+nright+1] #get x number of words after the current one (based on nright)
			if len(collocates) == 0: #if no collocates are defined
				hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words

			else:
				colmatch = False #switch to

				if regex == False:
					for y in left + right:
						if y in collocates:
							colmatch = True

				if regex == True:
					for y in left + right:
						for item in collocates:
							if re.compile(item).match(y) != None:
								colmatch = True
				if colmatch == True:
					hits.append([left,t,right]) #append a list consisting of a list of left words, the target word, and a list of right words

	return(hits)

def concord(tokenized_corp,target,nhits=25,nleft=10,nright=10,collocates = [], outname = "",sep = "\t", regex = False):
	hits = []

	for text in tokenized_corp:
		for hit in concord_text(text,target,nleft,nright,collocates,regex):
			hits.append(hit)

	# now we generate the random sample
	if len(hits) <= nhits: #if the number of search hits are less than or equal to the requested sample:
		print("Search returned " + str(len(hits)) + " hits.\n Returning all " + str(len(hits)) + " hits")
		if len(outname) > 0:
			write_concord(outname,hits,sep)
		return(hits) #return entire hit list
	else:
		print("Search returned " + str(len(hits)) + " hits.\n Returning a random sample of " + str(nhits) + " hits")
		if len(outname) > 0:
			write_concord(outname,hits,sep)
		return(random.sample(hits,nhits)) #return the random sample

frequency()

The frequency() function takes a list of tokenized corpus texts (i.e., a list of lists) such as that generated by the tokenize() function and returns a frequency (or range) dictionary with linguistic items (e.g., words) as keys and frequency (or range) values as values. The frequency() function takes four arguments (three of which have default values):

corpus_list (list of lists) This is a list consisting of tokenized texts (represented as lists of strings).
ignore (list of strings) This is a list of strings to ignore when calculating frequency. By default, this is the pre-defined ignore_list.
calc (string) This indicates whether the function will produce frequency or range values. The options are freq (default) or range.
normed (Boolean value) This indicates whether frequencies are normed (per million words) or represent raw frequencies. By default, this is set to False (raw frequencies are reported).

def frequency(corpus_list, ignore = ignore_list, calc = 'freq', normed = False): #options for calc are 'freq' or 'range'
	freq_dict = {} #empty dictionary

	for tokenized in corpus_list: #iterate through the tokenized texts
		if calc == 'range': #if range was selected:
			tokenized = list(set(tokenized)) #this creates a list of types (unique words)

		for token in tokenized: #iterate through each word in the texts
			if token in ignore_list: #if token is in ignore list
				continue #move on to next word
			if token not in freq_dict: #if the token isn't already in the dictionary:
				freq_dict[token] = 1 #set the token as the key and the value as 1
			else: #if it is in the dictionary
				freq_dict[token] += 1 #add one to the count

	### Normalization:
	if normed == True and calc == 'freq':
		corp_size = sum(freq_dict.values()) #this sums all of the values in the dictionary
		for x in freq_dict:
			freq_dict[x] = freq_dict[x]/corp_size * 1000000 #norm per million words
	elif normed == True and calc == "range":
		corp_size = len(corpus_list) #number of documents in corpus
		for x in freq_dict:
			freq_dict[x] = freq_dict[x]/corp_size * 100 #create percentage (norm by 100)

	return(freq_dict)

head()

The head() function takes a dictionary of word (or n-gram) keys with a statistic (e.g., frequency) as the values and returns a sorted representation of the dictionary. The main purpose of the head() function is to quickly check any lists generated by other functions. It can print the results, save the sorted list as a Python object, or write a sorted list to file. The head() function takes six arguments (five of which have default values):

stat_dict (dictionary) This is a dictionary of item : statistic key - value pairs (e.g., the output of the frequency() function).
hits (integer) This is an integer indicating how many items in the sample should be printed. (Note that this is ignored if writing to a file or saving a Python object). By default, this is set to 20.
hsort (Boolean value) If True the values are sorted from largest to smallest. If False, the values are sorted from smallest to largest. By default, this is set to True.
output (Boolean value) If True, the head() function outputs a list consisting of (item, statistic) tuples. If set to True, no items are printed to the console. By default, this is set to False.
filename (string) Providing a filename (e.g., “my_frequency.txt”) string will cause the head() function to write a spreadsheet file including all items and their statistic (e.g., frequency value). By default, this is set to None, and no file is written.
sep (string) This sets the character(s) used to separate columns when lists are written to a file. By default, this is a tab character "\t"

def head(stat_dict,hits = 20,hsort = True,output = False,filename = None, sep = "\t"):
	#first, create sorted list. Presumes that operator has been imported
	sorted_list = sorted(stat_dict.items(),key=operator.itemgetter(1),reverse = hsort)[:hits]

	if output == False and filename == None: #if we aren't writing a file or returning a list
		for x in sorted_list: #iterate through the output
			print(x[0] + "\t" + str(x[1])) #print the sorted list in a nice format

	elif filename is not None: #if a filename was provided
		outf = open(filename,"w") #create a blank file in the working directory using the filename
		for x in sorted_list: #iterate through list
			outf.write(x[0] + sep + str(x[1])+"\n") #write each line to a file using the separator
		outf.flush() #flush the file buffer
		outf.close() #close the file

	if output == True: #if output is true
		return(sorted_list) #return the sorted list

tag()

insert description here

def tag(corpus,tp = "upos", lemma = True, pron = False, lower = True, connect = "_",ignore = ["PUNCT","SPACE","SYM"],ngram = False,ngrm_connect = "__"):

	#check to make sure a valid tag was chosen
	if tp not in ["penn","upos","dep"]:
		print("Please use a valid tag type: 'penn','upos', or 'dep'")
	else:
		for text in corpus:
			doc = nlp(text) #use spacy to tokenize, lemmatize, pos tag, and parse the text
			text_list = [] #empty list for output
			for token in doc: #iterate through the tokens in the document
				if token.pos_ in ignore: #if the universal POS tag is in our ignore list, then move to next word
					continue

				if lemma == True: #if we chose lemma (this is the default)
					if pron == False: #if we don't want Spacy's pronoun lemmatization
						if token.lemma_ == "-PRON-":
							word = token.text.lower() #then use the raw form of the word
						else:
							word = token.lemma_ #otherwise the word form will be a lemma
					else:
						word = token.lemma_ #then the word form will be a lemma
				else:
					if lower == True: #if we we chose lemma = False but we want our words lowered (this is default)
						word = token.text.lower() #then lower the word
					else:
						word = token.text #if we chose lemma = False and lower = False, just give us the word

				if tp == None: #if tp = None, then just give the tokenized word (and nothing else)
					text_list.append(word)

				else:
					if tp == "penn":
						tagged = token.tag_ #modified penn tag
					elif tp == "upos":
						tagged = token.pos_ #universal pos tag
					elif tp == "dep":
						tagged = token.dep_ #dependency relationship

				tagged_token = word + connect + tagged #add word, connector ("_" by default), and tag
				text_list.append(tagged_token) #add to list

			if ngram != False:
				text_list = ngrammer(text_list,ngram,ngrm_connect)
			yield(text_list) #yield text list

write_corpus()

insert description here

def write_corpus(new_dirname,corpus, dirname = False, ending = "txt"):
	name_list = []
	if dirname != False:
		for x in glob.glob(dirname + "/*" + ending):
			simple_name = x.split(dirsep)[-1] #split the long directory name by the file separator and take the last item (the short filename)
			name_list.append(simple_name)

	try:
		os.mkdir(new_dirname + "/") #make the new folder
	except FileExistsError: #if folder already exists, then print message
		print("Writing files to existing folder")

	for i, document in enumerate(corpus): #use enumerate to iterate through the corpus list
		if dirname == False:
			new_filename = new_dirname + "/" + str(i+1) + "." + ending #create new filename
		else:
			new_filename = new_dirname + "/" + name_list[i] #create new filename
		outf = open(new_filename,"w") #create outfile with new filename
		corpus_string = " ".join(document) #turn corpus list into string
		outf.write(corpus_string) #write corpus list
		outf.flush()
		outf.close()

dep_bigram()

insert description here

def dep_bigram(corpus,dep,lemma = True, lower = True, pron = False, dep_upos = None, head_upos = None, dep_text = None, head_text = None):

	bi_freq = {} #holder for dependency bigram frequency
	dep_freq = {} #holder for depenent frequency
	head_freq = {} #holder for head frequency
	range_freq = {}
	match_sentences = [] #holder for sentences that include matches

	def dicter(item,d): #d is a dictinoary
		if item not in d:
			d[item] = 1
		else:
			d[item] +=1

	textid = 0
	for text in corpus: #iterate through corpus filenames
		textid+=1
		range_list = [] #for range information
		#sent_text = "first"
		doc = nlp(text) #tokenize, tag, and parse text using spaCy
		for sentence in doc.sents: #iterate through sentences
			#print(sent_text)
			index_start = 0 #for identifying sentence-level indexes later
			sent_text = [] #holder for sentence
			dep_headi = [] #list for storing [dep,head] indexes
			first_token = True #for identifying index of first token

			for token in sentence: #iterate through tokens in document
				if first_token == True:
					index_start = token.i #if this is the first token, set the index start number
					first_token = False #then set first token to False

				sent_text.append(token.text) #for adding word to sentence

				if token.dep_ == dep: #if the token's dependency tag matches the one designated
					dep_tg = token.pos_ #get upos tag for the dependent (only used if dep_upos is specified)
					head_tg = token.head.pos_ #get upos tag for the head (only used if dep_upos is specified)

					if lemma == True: #if lemma is true, use lemma form of dependent and head
						if pron == False: #if we don't want Spacy's pronoun lemmatization
							if token.lemma_ == "-PRON-":
								dependent = token.text.lower() #then use the raw form of the word
								headt = token.head.text.lower()
							else:
								dependent = token.lemma_
								headt = token.head.lemma_
						else:
							dependent = token.lemma_
							headt = token.head.lemma_

					if lemma == False: #if lemma is false, use the token form
						if lower == True: #if lower is true, lower it
							dependent = token.text.lower()
							headt = token.head.text.lower()
						else: #if lower is false, don't lower
							dependent = token.text
							headt = token.head.text

					if dep_upos != None and dep_upos != dep_tg: #if dependent tag is specified and upos doesn't match, skip item
						continue

					if head_upos != None and head_upos!= head_tg: #if head tag is specified and upos doesn't match, skip item
						continue

					if dep_text != None and dep_text != dependent: #if dependent text is specified and text doesn't match, skip item
						continue

					if head_text != None and head_text != headt: #if head text is specified and text doesn't match, skip item
						continue

					dep_headi.append([token.i-index_start,token.head.i-index_start]) #add sentence-level index numbers for dependent and head

					dep_bigram = dependent + "_" + headt #create dependency bigram

					range_list.append(dep_bigram) #add to document-level range list
					dicter(dep_bigram,bi_freq) #add values to frequency dictionary
					dicter(dependent,dep_freq) #add values to frequency dictionary
					dicter(headt,head_freq) #add values to frequency dictionary

			### this section is for creating a list of sentences that include our hits ###
			for x in dep_headi: #iterate through hits

				temp_sent = sent_text.copy() #because there may be multiple hits in each sentence (but we only want to display one hit at at time), we make a temporary copy of the sentence that we will modify

				depi = sent_text[x[0]] + "_" + dep+ "_dep" #e.g., word_dobj_dep
				headi = sent_text[x[1]] + "_" + dep+ "_head" #e.g., word_dobj_head

				temp_sent[x[0]] = depi #change dependent word to depi in temporary sentence
				temp_sent[x[1]] = headi ##change head word to headi in temporary sentence

				temp_sent.append(str(textid)) ## add file inded to sent to indicate where example originated
				match_sentences.append(temp_sent) #add temporary sentence to match_sentences for output

		for x in list(set(range_list)): #create a type list of the dep_bigrams in the text
			dicter(x,range_freq) #add document counts to the range_freq dictionary


	bigram_dict = {"bi_freq":bi_freq,"dep_freq":dep_freq,"head_freq": head_freq, "range":range_freq, "samples":match_sentences} #create a dictioary of dictionaries
	return(bigram_dict) # return dictionary of dictionaries