Introduction to POS Tagging (Part 6 - Perceptron Tagger)

(Kristopher Kyle - Updated 2021-05-21)

In this tutorial, we will work with the Perceptron Tagger, which can approach state of the art accuracy, trains and tags quickly, and is reasonably simple. The original creator of this implementation (Mathew Honnibal, who created Spacy, wrote up a nice explanation of how the Perceptron Tagger works back in 2013.

The TLDR version is that a perceptron is a very simple neural net (see more in-depth explanations here and here).

In this tutorial, we will:

All data for this tutorial can be downloaded here. In order to follow along with the tutorial, you will want to unzip the folder, create a new .py file in it (e.g., “part_6.py”), and set your working directory.

Getting data ready

Perceptron Tagger

Features

Two versions of the Perceptron Tagger program are included in the downloaded folder. The versions are identical except for the feature set. The simple version, which includes all of the features from tutorial 5 is included below. As we can see from the lines that are commented out, we can add a lot more features. Note that this function is embedded inside the PerceptronTagger() class - it is included below for illustrative purposes.

def _get_features(self, i, word, context, prev, prev2):
	'''Map tokens into a feature representation, implemented as a
	{hashable: int} dict. If the features change, a new model must be
	trained.
	'''
	def add(name, *args):
		features[' '.join((name,) + tuple(args))] += 1

	i += len(self.START)
	features = defaultdict(int)
	# It's useful to have a constant feature, which acts sort of like a prior
	#From Kris: also possible given this code are prev2 (e.g., DT if our target tag is NN: DT JJ NN)
	add('bias')
	add('i suffix', context[i][-3:]) #current word suffix
	add('i-1 tag', prev) #previous tag
	add('i tag+i-2 tag', prev, prev2)
	add('i word', context[i]) #current word
# 		add('i-1 tag+i word', prev, context[i]) #previous tag+word bigram
# 		add('i-1 word', context[i-1]) #previous word
# 		add('i+1 word', context[i+1]) #next word
# 		add('i pref1', word[0]) #first letter prefix
# 		add('i pref3', context[i][:3]) #first three letters prefix
# 		add('i-2 tag', prev2) # second previous tag
# 		add('i-1 suffix', context[i-1][-3:]) #previous suffix
# 		add('i-2 word', context[i-2]) #second previous word
# 		add('i+1 suffix', context[i+1][-3:]) #next word suffix
# 		add('i+2 word', context[i+2]) #second next word

	return features

Other Variables We Can Adjust

The Perceptron Tagger takes advantage of an unambiguous word dictionary, which can be tuned based on the acceptable ambiguity levels (by default this is 97%) and the number of occurrences required before a word can possibly be considered “unambiguous” (by default this is set at 20). These parameters are set in the _make_tag_dict() function of the PerceptronTagger() class.

def _make_tagdict(self, sentences):
	'''
	Make a tag dictionary for single-tag words.
	:param sentences: A list of list of (word, tag) tuples.
	'''
	counts = defaultdict(lambda: defaultdict(int))
	for sentence in sentences:
		self._sentences.append(sentence)
		for word, tag in sentence:
			counts[word][tag] += 1
			self.classes.add(tag)
	freq_thresh = 20 #frequency threshold
	ambiguity_thresh = 0.97 #ambiguity threshold
	for word, tag_freqs in counts.items():
		tag, mode = max(tag_freqs.items(), key=lambda item: item[1])
		n = sum(tag_freqs.values())
		# Don't add rare words to the tag dictionary
		# Only add quite unambiguous words
		if n >= freq_thresh and (mode / n) >= ambiguity_thresh:
			self.tagdict[word] = tag

Formatting Data

To train the Perceptron Tagger, data should be formatted as a list of sentences of (word, tag) tuples. We will import the data we have used in previous tutorials, which is formatted as list of sentences of {“word”: word, “pos”: tag} dictionaries, and then convert it.

import random
import pickle
random.seed(10) #set seed so we get the same results each time

def tupler(lolod):
	outlist = [] #output
	for sent in lolod: #iterate through sentences
		outsent = []
		for token in sent: #iterate through tokens
			outsent.append((token["word"],token["pos"])) #create tuples
		outlist.append(outsent)
	return(outlist) #return list of lists of tuples

### load data: ###
full_data = tupler(pickle.load(open("brown_sents_list.pickle","rb")))

### create training and test sets ###
train_data = random.sample(full_data,34912) #create training set with 67% of sentences

test_data = [sent for sent in full_data if sent not in train_data]

Training and Testing the Perceptron Tagger

We will import two versions of the Perceptron Tagger that differ with regard to the number of predictor features used. Note that each model is saved in its own .py file - we can import .py files that are in our working directory by using the .py filename (with out the .py extension). The first version of the tagger is saved in the file “simple_perceptron.py”.

Training the Perceptron Tagger (SimpleTron)

from simple_perceptron import PerceptronTagger as SimpleTron #import PerceptronTagger from simple_perceptron.py as SimpleTron

tagger = SimpleTron(load=False) #define tagger

tagger.train(train_data,save_loc = "small_feature_Browntrain_perceptron.pickle") #train tagger on train_data, save the model as "small_feature_Browntrain_perceptron.pickle"

#load pretrained model (if needed)
tagger = SimpleTron(load = True, PICKLE = "small_feature_Browntrain_perceptron.pickle")

Testing the Perceptron Tagger (SimpleTron)

To test the tagger, we will strip the tags from our test set and then tag one sentence at a time. As we see below, the simple version of the tagger achieves 94.5% macro accuracy.

### strip tags if necessary, apply tagger
def test_tagger(test_sents,model,tag_strip = False, word_loc = 0):

	if tag_strip == True:
		sent_words = []
		for sent in test_sents:
			ws = []
			for token in sent:
				ws.append(token[word_loc])
			sent_words.append(ws)
	else:
		sent_words = test_sents

	tagged_sents = []

	for sent in sent_words:
		tagged_sents.append(model.tag(sent))

	return(tagged_sents)

def simple_accuracy_sent(gold,test): #this takes a hand-tagged list (gold), and a machine-tagged text (test) and calculates the simple accuracy
	correct = 0 #holder for correct count
	nwords = 0 #holder for total words count

	for sent_id, sents in enumerate(gold): #iterate through sentences. Note enumerate() adds the index. So here, we define the index as "sent_id", and the item as "sents"
		for word_id, (gold_word, gold_tag) in enumerate(sents): #now, we iterate through the words in each sentence using enumerate(). the format is now [index, [word, tag]]. We define the index as 'word_id', the word as 'word' and the tag as 'tag'
			nwords += 1
			if gold_tag == test[sent_id][word_id][1]: #if the tag is correct, add one to the correct score
				correct +=1

	return(correct/nwords)

tagged_test = test_tagger(test_data,tagger,tag_strip = True)
print(simple_accuracy_sent(tagged_test,test_data)) #0.9453921817719865

Below, we also adapt previously used code to check the by-tag accuracy of our tagger:

def prec_rec(accuracy_dict):
	accuracy_dict["TC"] = accuracy_dict["TP"] + accuracy_dict["FN"]
	if accuracy_dict["TP"] + accuracy_dict["FN"] == 0:
		accuracy_dict["recall"] = 0
	else:
		accuracy_dict["recall"] = accuracy_dict["TP"]/(accuracy_dict["TP"] + accuracy_dict["FN"])

	if accuracy_dict["TP"] +accuracy_dict["FP"] == 0:
		accuracy_dict["precision"] = 0
	else:
		accuracy_dict["precision"] = accuracy_dict["TP"]/(accuracy_dict["TP"] +accuracy_dict["FP"])
	if accuracy_dict["precision"] == 0 and accuracy_dict["recall"] == 0:
		accuracy_dict["f1"] = 0
	else:
		accuracy_dict["f1"] = 2 * ((accuracy_dict["precision"] * accuracy_dict["recall"])/(accuracy_dict["precision"] + accuracy_dict["recall"]))

def tag_prec_rec(tested,gold):
	tag_d = {}

	for sent_id, sent in enumerate(gold):
		for idx, (word, tag) in enumerate(sent):
			### update tag dictionary as needed ###
			tested_tag = tested[sent_id][idx][1]

			if tag not in tag_d:
				tag_d[tag] = {"TP":0,"FP":0,"FN":0}
			if tested_tag not in tag_d:
				tag_d[tested_tag] = {"TP":0,"FP":0,"FN":0}

			### tabulate accuracy ###
			if tag == tested_tag:
				tag_d[tag]["TP"] += 1
			else:
				tag_d[tag]["FN"] += 1
				tag_d[tested_tag]["FP"] += 1

	for x in tag_d:
		prec_rec(tag_d[x])

	return(tag_d)

cmplx_acc_simp = tag_prec_rec(tagged_test,test_data)

from operator import *

#output F1 score of top ten most frequent tags
for x in sorted(cmplx_acc_simp.items(),key=lambda x:getitem(x[1],'TC'), reverse = True)[:10]:
	print(x[0], x[1]["f1"])
NN 0.9260796954740957
IN 0.9674883035869
DT 0.9862300154718927
JJ 0.8778548600826253
NNP 0.8577973386868453
, 0.9999479220914488
NNS 0.9604092289657377
. 0.9999725101025374
RB 0.9083532219570406
PRP 0.9856693014587752

Training the Testing the Perceptron Tagger (FullTron)

Now, we will use the “fully” featured version of the Perceptron Tagger to increase the accuracy of our system. To reach fully state of the art accuracy (97-98%), we would need to have more training data - but our model below comes very close (96.27% macro accuracy).

from full_perceptron import PerceptronTagger as FullTron

tagger2 = FullTron(load=False)

tagger2.train(train_data,save_loc = "full_feature_Browntrain_perceptron.pickle")

#load pretrained model (if needed)
tagger2 = FullTron(load = True, PICKLE = "full_feature_Browntrain_perceptron.pickle")

tagged_test2 = test_tagger(test_data,tagger2,tag_strip = True)

simple_accuracy_sent(test_data,tagged_test2) #test 1 (small set): 0.9627081205245337

cmplx_acc = tag_prec_rec(tagged_test2,test_data)

for x in sorted(cmplx_acc.items(),key=lambda x:getitem(x[1],'TC'), reverse = True)[:10]:
	print(x[0], x[1]["f1"])
NN 0.9523881229118629
IN 0.9777645882107253
DT 0.9907638251973182
JJ 0.9085173501577287
NNP 0.9416097300964286
, 0.9999479220914488
NNS 0.9753486863444698
. 0.9999725101025374
RB 0.929699795977849
PRP 0.9927594529364441

Implementing a Fully Functional Tagger

We can easily implement a full tagging pipeline by borrowing a sentence tokenizer and word tokenizer from the Natural Language Toolkit (NLTK). There are certainly more accurate tokenizers out there, but the default implementations in NLTK are both quick and reasonably accurate.

from nltk.tokenize import sent_tokenize, word_tokenize

def tag_strings(input_string,trained_tagger):
	tagged_sents = []

	sents = sent_tokenize(input_string) #use nltk sent tokenize to separate strings into sentences
	for sent in sents:
		tagged_sents.append(trained_tagger.tag(word_tokenize(sent)))

	return(tagged_sents)

for sents in tag_strings("I really love pizza. Do you love pizza?", tagger2):
	print(sents)
[('I', 'PRP'), ('really', 'RB'), ('love', 'VBP'), ('pizza', 'NN'), ('.', '.')]
[('Do', 'VBP'), ('you', 'PRP'), ('love', 'VB'), ('pizza', 'NN'), ('?', '.')]