Skip to the content.

Corpus-toolkit

The corpus-toolkit package grew out of courses in corpus linguistics and learner corpus research. The toolkit attempts to balance simplicity of use, broad application, and scalability. Common corpus analyses such as the calculation of word and n-gram frequency and range, keyness, and collocation are included. In addition, more advanced analyses such as the identification of dependency bigrams (e.g., verb-direct object combinations) and their frequency, range, and strength of association are also included.

More details on each function in the package (including various option settings) can be found on the corpus-toolkit resource page.

Install corpus-toolkit

The package can be downloaded using pip

pip install corpus-toolkit

Dependencies

The corpus-toolkit package makes use of Spacy for tagging and parsing. However, the package also includes a tokenization and lemmatization function that does not require Spacy. If you want to tag or parse your files, you will need to install Spacy (and an appropriate Spacy language model).

pip install -U spacy
python -m spacy download en_core_web_sm

Quickstart guide

There are three corpus pre-processing options. The first is to use the tokenize() function, which does not rely on a part of speech tagger. The second is to use the tag() function, which uses Spacy to tokenize and tag the corpus. The third option is to pre-process the corpus in any way you like before using the other functions of the corpus-toolkit package.

This tutorial presumes that you have downloaded and extracted the brown_single.zip, which is a version of the Brown corpus. The folder “brown_single” should be in your working directory.

Load, tokenize, and generate a frequency list

from corpus_toolkit import corpus_tools as ct
brown_corp = ct.ldcorpus("brown_single") #load and read corpus
tok_corp = ct.tokenize(brown_corp) #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary
#note that range can be calculated instead of frequency using the argument calc = "range"
ct.head(brown_freq, hits = 10) #print top 10 items

the     69836
be      37689
of      36365
a       30475
and     28826
to      26126
in      21318
he      19417
have    11938
it      10932

The functions ldcorpus() and tokenize() are Python generators, which means that they must be re-declared each time they are used (iterated over). A slightly messier (but more appropriate) way to achieve the results above is to nest the commands.

brown_freq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single")))
ct.head(brown_freq, hits = 10)
the     69836
be      37689
of      36365
a       30475
and     28826
to      26126
in      21318
he      19417
have    11938
it      10932

Note that the frequency() function can also calculate range and normalized frequency figures. See the resource page for details.

Generate concordance lines

Concordance lines can be generated using the concord() function. By default, a random sample of 25 hits will be generated, with 10 tokens of left and right context.

conc_results1 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],nhits = 10)
for x in conc_results1:
	print(x)
[['buckle', 'drag', 'the', 'wagons', 'to', 'the', 'spring', 'lew', 'durkin', 'yelled'], 'run', ['em', 'right', 'into', 'the', 'spring', 'hustle', 'one', 'of', 'the', 'wagons']]
[['his', 'sweater', 'soaking', 'into', 'a', 'dark', 'streak', 'of', 'dirt', 'that'], 'ran', ['diagonally', 'across', 'the', 'white', 'wool', 'on', 'his', 'shoulder', 'as', 'though']]
[['took', 'a', 'hasty', 'shot', 'then', 'fled', 'without', 'knowing', 'the', 'result'], 'ran', ['until', 'breath', 'was', 'a', 'pain', 'in', 'his', 'chest', 'and', 'his']]
[['back', 'to', 'new', 'york', 'as', 'maude', 'suggested', 'she', 'would', 'nt'], 'run', ['like', 'a', 'scared', 'cat', 'but', 'well', 'she', 'd', 'be', 'very']]
[['with', 'that', 'soap', 'i', 'was', 'loaded', 'with', 'suds', 'when', 'i'], 'ran', ['away', 'and', 'i', 'have', 'nt', 'had', 'a', 'chance', 'to', 'wash']]
[['conditions', 'of', 'international', 'law', 'are', 'met', 'countries', 'that', 'try', 'to'], 'run', ['the', 'blockade', 'do', 'so', 'at', 'their', 'own', 'risk', 'blockade', 'runners']]
[['produce', 'something', 'which', 'has', 'not', 'previously', 'existed', 'thus', 'creativity', 'may'], 'run', ['all', 'the', 'way', 'from', 'making', 'a', 'cake', 'building', 'a', 'chicken']]
[['from', 'the', 'school', 'he', 'did', 'nt', 'look', 'back', 'and', 'he'], 'ran', ['until', 'he', 'was', 'out', 'of', 'sight', 'of', 'the', 'schoolhouse', 'and']]
[['in', 'my', 'body', 'i', 'could', 'light', 'all', 'the', 'lights', 'and'], 'run', ['all', 'the', 'factories', 'in', 'the', 'entire', 'united', 'states', 'for', 'some']]
[['in', 'any', 'time', 'they', 'please', 'sergeant', 'no', 'sir', 'running', 'in'], 'running', ['out', 'ca', 'nt', 'have', 'it', 'makes', 'for', 'confusion', 'and', 'congestion']]

Collocates can also be added as secondary search terms:

conc_results2 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"],collocates = ["quick","quickly"], nhits = 10)
for x in conc_results2:
	print(x)
[['range', 'and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly'], 'ran', ['into', 'the', 'same', 'trouble', 'that', 'plagued', 'bill', 'ruger', 'in', 'his']]
[['s', 'nest', 'to', 'the', 'rocky', 'ribs', 'of', 'the', 'canyonside', 'russ'], 'ran', ['up', 'the', 'steps', 'quickly', 'to', 'the', 'plank', 'porch', 'the', 'front']]
[['hands', 'and', 'feet', 'keeping', 'the', 'hands', 'in', 'the', 'starting', 'position'], 'run', ['in', 'place', 'to', 'a', 'quick', 'rhythm', 'after', 'this', 'has', 'become']]
[['engine', 'up', 'to', 'operating', 'temperature', 'quickly', 'and', 'to', 'keep', 'it'], 'running', ['at', 'its', 'most', 'efficient', 'temperature', 'through', 'the', 'proper', 'circulation', 'of']]

Search terms (and collocate search terms) can also be interpreted as regular expressions:

conc_results3 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run.*","ran"],collocates = ["quick.*"], nhits = 10, regex = True)
for x in conc_results3:
	print(x)
[['impact', 'we', 'fired', 'this', 'little', '20-inch-barrel', 'job', 'on', 'my', 'home'], 'range', ['and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly', 'ran']]
[['range', 'and', 'in', 'marlin', 's', 'underground', 'test', 'gallery', 'we', 'quickly'], 'ran', ['into', 'the', 'same', 'trouble', 'that', 'plagued', 'bill', 'ruger', 'in', 'his']]
[['minutes', 'the', 'gallery', 'leaders', 'had', 'given', 'the', 'students', 'a', 'quick'], 'rundown', ['on', 'art', 'from', 'the', 'renaissance', 'to', 'the', 'late', '19th', 'century']]
[['s', 'nest', 'to', 'the', 'rocky', 'ribs', 'of', 'the', 'canyonside', 'russ'], 'ran', ['up', 'the', 'steps', 'quickly', 'to', 'the', 'plank', 'porch', 'the', 'front']]
[['hands', 'and', 'feet', 'keeping', 'the', 'hands', 'in', 'the', 'starting', 'position'], 'run', ['in', 'place', 'to', 'a', 'quick', 'rhythm', 'after', 'this', 'has', 'become']]
[['engine', 'up', 'to', 'operating', 'temperature', 'quickly', 'and', 'to', 'keep', 'it'], 'running', ['at', 'its', 'most', 'efficient', 'temperature', 'through', 'the', 'proper', 'circulation', 'of']]

Concordance lines can also be written to a file for easier analysis (e.g., using spreadsheet software). By default, items are separated by tab characters (“\t”).

#write concordance lines to a file called "run_25.txt"
conc_results4 = ct.concord(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False),["run","ran","running","runs"], outname = "run_25.txt")

Create a tagged version of your corpus

The most efficient way to conduct multiple analyses with a tagged corpus is to write a tagged version of your corpus to file and then conduct subsequent analyses with the tagged files. If this is not possible for some reason, one can always run the tagger each time an analysis is conducted.

tagged_brown = ct.tag(ct.ldcorpus("brown_single"))
ct.write_corpus("tagged_brown_single",tagged_brown) #the first argument is the folder where the tagged files will be written

The function tag() is also a Python generator, so the preferred way to write a corpus is:

ct.write_corpus("tagged_brown_single",ct.tag(ct.ldcorpus("brown_single")))

Now, we can reload our tagged corpus using the reload() function and generate a part of speech sensitive frequency list.

tagged_freq = ct.frequency(ct.reload("tagged_brown_single"))
ct.head(tagged_freq, hits = 10)
the_DET 69861
be_VERB 37800
of_ADP  36322
and_CCONJ       28889
a_DET   23069
in_ADP  20967
to_PART 15409
have_VERB       11978
to_ADP  10800
he_PRON 9801

Collocation

Use the collocator() function to find collocates for a particular word.

collocates = ct.collocator(ct.tokenize(ct.ldcorpus("brown_single")),"go",stat = "MI")
#stat options include: "MI", "T", "freq", "left", and "right"

ct.head(collocates, hits = 10)
downstairs      7.875170389265524
upstairs        6.915812373762869
bedroom 6.627242875821938
abroad  6.273134375185426
re      6.21620730710059
m       6.211322724303333
forever 6.174730671124432
stanley 6.174730671124432
let     5.938347287580174
wrong   5.868744120106091

Keyness

Keyness is calculated using two frequency dictionaries (consisting of raw frequency values). Only effect sizes are reported (p values are arguably not particularly useful for keyness analyses). Keyness calculation options include “log-ratio”, “%diff”, and “odds-ratio”.

#First, generate frequency lists for each corpus
corp1freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp1")))
corp2freq = ct.frequency(ct.tokenize(ct.ldcorpus("corp2")))

#then calculate Keyness
corp_key = ct.keyness(corp1freq,corp2freq, effect = "log-ratio")
ct.head(corp_key, hits = 10) #to display top hits

N-grams

N-grams are contiguous sequences of n words. The tokenize() function can be used to create an n-gram version of a corpus by employing the ngram argument. By default, words in an n-gram are separated by two underscores “__”

trigramfreq = ct.frequency(ct.tokenize(ct.ldcorpus("brown_single"),lemma = False, ngram = 3))
ct.head(trigramfreq, hits = 10)
one__of__the    404
the__united__states     339
as__well__as    237
some__of__the   179
out__of__the    172
the__fact__that 167
i__do__nt       162
the__end__of    149
part__of__the   144
it__was__a      143

Dependency bigrams

Dependency bigrams consist of two words that are syntactically connected via a head-dependent relationship. For example, in the clause “The player kicked the ball”, the main verb kicked is connected to the noun ball via a direct object relationship, wherein kicked is the head and ball is the dependent.

The function dep_bigram() generates frequency dictionaries for the dependent, the head, and the dependency bigram. In addition, range is calculated along with a complete list of sentences in which the relationship occurs.

bg_dict = ct.dep_bigram(ct.ldcorpus("brown_single"),"dobj")
ct.head(bg_dict["bi_freq"], hits = 10)
#other keys include "dep_freq", "head_freq", and "range"
#also note that the key "samples" can be used to obtain a list of sample sentences
#but, this is not compatible with the ct.head() function (see ct.dep_conc() instead)
#all dependency bigrams are formatted as dependent_head
what_do 247
place_take      84
what_say        80
him_told        67
it_do   63
that_do 51
time_have       49
what_mean       46
this_do 46
what_call       42

Strength of association

Various measures of strength of association can calculated between dependents and heads. The soa() function takes a dictionary generated by the dep_bigram() function and calculates the strength of association for each dependency bigram.

soa_mi = ct.soa(bg_dict,stat = "MI")
#other stat options include: "T", "faith_dep", "faith_head","dp_dep", and "dp_head"
ct.head(soa_mi, hits = 10)
radiation_ionize        12.037110123486007
B_paragraph     12.037110123486007
suicide_commit  10.648544835568353
nose_scratch    10.39700606857239
calendar_adjust 9.972979786066292
imagination_capture     9.774075717652213
nose_blow       9.672113306706759
English_speak   9.496541742123304
throat_clear    9.367258725178337
expense_deduct  9.256227412789594

Concordance lines for dependency bigrams

A number of excellent cross-platform GUI- based concordancers such as AntConc are freely available, and are likely the preferred method for most concordancing.

However, it is difficult to get concordance lines for dependency bigrams without a more advanced program. The dep_conc() function takes the samples generated by the dep_bigram() function and creates a random sample of hits (50 hits by default) formatted as an html file.

The following example will write an html file named “dobj_results.html” to your working directory.

ct.dep_conc(bg_dict["samples"],"dobj_results")

When opened, the resulting file will include the following:

A fringe of housing and gardens bearded_dobj_head the top_dobj_dep of the heights , and behind it were sandy roads leading past farms and hayfields . 39

A man with insomnia had better avoid_dobj_head bad dreams_dobj_dep of that kind if he knew what was good for him . 241

He simply would not work_dobj_head his arithmetic problems_dobj_dep when the teacher held his class . 192

You may be sure he marries her in the end and has_dobj_head a fine old knockdown fight_dobj_dep with the brother , and that there are plenty of minor scraps along the way to ensure that you understand what the word Donnybrook means . 198

Anyone familiar with the details of the McClellan hearings must at once realize that the sweetheart arrangements augmented_dobj_head employer profits_dobj_dep far more than they augmented the earnings of the corruptible labor leaders . 407

If the transferor has_dobj_head substantial assets_dobj_dep other than the claim , it seems reasonable to assume no corporation would be willing to acquire all of its properties in the dim hope of collecting a claim for refund of taxes . 433

For the first few months of their marriage she had tried to be nice about Gunny , going out with him to watch_dobj_head this pearl_dobj_dep without price stamp imperiously around in her stall . 441

If the site is on a reservoir , the level of the water at various seasons as it affects_dobj_head recreation_dobj_dep should be studied . 471

She thrust forward through the shadows and the trees that resisted_dobj_head her_dobj_dep and tried to fling her back . 226

The most infamous of all was launched by the explosion of the island of Krakatoa in 1883 ; ; it raced across the Pacific at 300 miles an hour , devastated_dobj_head the coasts_dobj_dep of Java and Sumatra with waves 100 to 130 feet high , and pounded the shore as far away as San Francisco . 40