Introduction to POS Tagging (Part 7 - Beyond English)

(Kristopher Kyle - Updated 2021-05-26)

In this tutorial we apply the Perceptron Tagger to new datasets and languages. We will use data from https://universaldependencies.org/, which includes a large number of datasets in .conllu format across a wide range of languages.

It should be noted that the universal dependency datasets are tagged with “universal” part of speech tags, which tend to be less specific than language specific tags. However, in many cases often include language-specific tags as well (such as Penn tags for English).

Dealing with .conllu files

In order to use these datasets, we will need to extract pertinent information from .conllu files, which are formatted as follows:

# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0006
# text = The third was being run by the head of an investment firm.
The	the	DET	DT	Definite=Def|PronType=Art	2	det	2:det	_
third	third	ADJ	JJ	Degree=Pos|NumType=Ord	5	nsubj:pass	5:nsubj:pass	_
was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	5	aux	5:aux	_
being	be	AUX	VBG	VerbForm=Ger	5	aux:pass	5:aux:pass	_
run	run	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	0:root	_
by	by	ADP	IN	_	8	case	8:case	_
the	the	DET	DT	Definite=Def|PronType=Art	8	det	8:det	_
head	head	NOUN	NN	Number=Sing	5	obl	5:obl:by	_
of	of	ADP	IN	_	12	case	12:case	_
an	a	DET	DT	Definite=Ind|PronType=Art	12	det	12:det	_
investment	investment	NOUN	NN	Number=Sing	12	compound	12:compound	_
firm	firm	NOUN	NN	Number=Sing	8	nmod	8:nmod:of	SpaceAfter=No
.	.	PUNCT	.	_	5	punct	5:punct	_

# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0007
# text = You wonder if he was manipulating the market with his bombing targets.
You	you	PRON	PRP	Case=Nom|Person=2|PronType=Prs	2	nsubj	2:nsubj	_
wonder	wonder	VERB	VBP	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	0:root	_
if	if	SCONJ	IN	_	6	mark	6:mark	_
he	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	6	nsubj	6:nsubj	_
was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	6	aux	6:aux	_
manipulating	manipulate	VERB	VBG	Tense=Pres|VerbForm=Part	2	ccomp	2:ccomp	_
the	the	DET	DT	Definite=Def|PronType=Art	8	det	8:det	_
market	market	NOUN	NN	Number=Sing	6	obj	6:obj	_
with	with	ADP	IN	_	12	case	12:case	_
his	he	PRON	PRP$	Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs	12	nmod:poss	12:nmod:poss	_
bombing	bombing	NOUN	NN	Number=Sing	12	compound	12:compound	_
targets	target	NOUN	NNS	Number=Plur	6	obl	6:obl:with	SpaceAfter=No
.	.	PUNCT	.	_	2	punct	2:punct	_

With the exception of lines that start with “#”, each item in a line is separated by \t, which makes it easy to extract the information we need. The function below extracts the word and universal pos (upos) from each line and export a list of sentences of token dictionaries:

def conllu_dicter(text,splitter="\t"):
	output_list = [] #list for each sentence and token

	sents = text.split("\n\n") #split text into sentences

	for sent in sents:
		if sent == "":
			continue
		sent_anno = [] #sentence list for tokens
		lines = sent.split("\n")
		for line in lines:
			if len(line) == 0:
				continue
			token = {}
			if line[0] == "#": #skip lines without target annotation
				continue

			anno = line.split(splitter) #split by splitting character (in the case of our example, this will be a tab character)

			#now, we will grab all relevant information and add it to our token dictionary:
			#token["idx"] = anno[0]
			token["word"] = anno[1]#get word
			token["upos"] = anno[3] #get the universal pos tag
			#token["xpos"] = anno[4] #get the xpos tag(s) - in English this is usally Penn tags
			#token["dep"] = anno[7] #dependency relationship
			#token["head_idx"] = anno[6] #id of dependency head
			sent_anno.append(token) #append token dictionary to sentence level list
		output_list.append(sent_anno) #append sentence level list to
	return(output_list)

Because we are going to use the Perceptron Tagger, we will need to further convert the data in to a list of sentences of (word,upos) tuples. To do so, we will use a slightly modified version of the tupler() function from Tutorial 6

#We will use the simple perceptron
def tupler(lolod,posname = "pos"):
	outlist = [] #output
	for sent in lolod: #iterate through sentences
		outsent = []
		for token in sent: #iterate through tokens
			outsent.append((token["word"],token[posname])) #create tuples
		outlist.append(outsent)
	return(outlist) #return list of lists of tuples

Below, we load the Perceptron Tagger and some related functions from Tutorial 6. Don’t forget to put simple_perceptron.py and full_perceptron.py in your working directory!

from simple_perceptron import PerceptronTagger as SimpleTron #import PerceptronTagger from simple_perceptron.py as SimpleTron
from full_perceptron import PerceptronTagger as FullTron

### strip tags if necessary, apply tagger
def test_tagger(test_sents,model,tag_strip = False, word_loc = 0):

	if tag_strip == True:
		sent_words = []
		for sent in test_sents:
			ws = []
			for token in sent:
				ws.append(token[word_loc])
			sent_words.append(ws)
	else:
		sent_words = test_sents

	tagged_sents = []

	for sent in sent_words:
		tagged_sents.append(model.tag(sent))

	return(tagged_sents)

def simple_accuracy_sent(gold,test): #this takes a hand-tagged list (gold), and a machine-tagged text (test) and calculates the simple accuracy
	correct = 0 #holder for correct count
	nwords = 0 #holder for total words count

	for sent_id, sents in enumerate(gold): #iterate through sentences. Note enumerate() adds the index. So here, we define the index as "sent_id", and the item as "sents"
		for word_id, (gold_word, gold_tag) in enumerate(sents): #now, we iterate through the words in each sentence using enumerate(). the format is now [index, [word, tag]]. We define the index as 'word_id', the word as 'word' and the tag as 'tag'
			nwords += 1
			if gold_tag == test[sent_id][word_id][1]: #if the tag is correct, add one to the correct score
				correct +=1

	return(correct/nwords)

Testing Perceptron tagger on new datasets

We will test the “simple” and the “full” versions of the Perceptron tagger on three languages:

English (using the 254k-word English Web Treebank)
Korean (using the 350k-word KAIST Treebank)
Thai (using the 22k-word Prague Universal Dependencies Treekbank)

English

First, we will train a model based on the simple feature set. This model achieves 90% accuracy for universal pos tags on the test set.

#load English data
english_train = conllu_dicter(open("UD_English-EWT/en_ewt-ud-train.conllu").read())
english_test = conllu_dicter(open("UD_English-EWT/en_ewt-ud-test.conllu").read())

#convert from token dictionaries to token tuples:
english_train = tupler(english_train,posname = "upos")
english_test = tupler(english_test,posname = "upos")


#initialize and train tagger:
tagger_en = SimpleTron(load=False) #define tagger
tagger_en.train(english_train,save_loc = "small_feature_EWT_train_perceptron.pickle") #train tagger on train_data, save the model as "small_feature_Browntrain_perceptron.pickle"

#load pretrained model (if needed)
#tagger_en = SimpleTron(load = True, PICKLE = "small_feature_EWT_train_perceptron.pickle")

#check macro accuracy
tagged_en_test = test_tagger(english_test,tagger_en,tag_strip = True)
print(simple_accuracy_sent(tagged_en_test,english_test)) #0.9000589275191514

Next, we will test the “full” version of the tagger. This model achieves 93.6% accuracy on the test set data. While this is lower than with the Brown corpus, we also are using a substantially smaller set of training data.

tagger2_en = FullTron(load=False)

tagger2_en.train(english_train,save_loc = "full_feature_EWT_train_perceptron.pickle")

#load pretrained model (if needed)
#tagger2_en = FullTron(load = True, PICKLE = "full_feature_EWT_train_perceptron.pickle")

tagged_en_test2 = test_tagger(english_test,tagger2_en,tag_strip = True)

simple_accuracy_sent(english_test,tagged_en_test2) #0.936436849341976

Korean

First, we will train a model based on the simple feature set. This model achieves 84.7% accuracy for universal pos tags on the test set. Despite having more training data, the model is less accurate on the Korean data than the English data. This may mean that the feature set (or prediction algorithm) needs to be tuned for Korean. Note that if you get an encoding error, you will have to explicitly set “utf-8” as the encoding when using the open() function (e.g., open(filename, encoding = "utf-8")).

#load Korean data and convert to token tuples
korean_train = tupler(conllu_dicter(open("UD_Korean-Kaist/ko_kaist-ud-train.conllu").read()),posname = "upos")
korean_test = tupler(conllu_dicter(open("UD_Korean-Kaist/ko_kaist-ud-test.conllu").read()),posname = "upos")

tagger_ko = SimpleTron(load=False) #define tagger

tagger_ko.train(korean_train,save_loc = "small_feature_Kaist_train_perceptron.pickle") #train tagger on train_data, save the model as "small_feature_Browntrain_perceptron.pickle"

#load pretrained model (if needed)
tagger_ko = SimpleTron(load = True, PICKLE = "small_feature_Kaist_train_perceptron.pickle")


tagged_ko_test = test_tagger(korean_test,tagger_ko,tag_strip = True)
print(simple_accuracy_sent(tagged_ko_test,korean_test)) #0.847846012832264

Next, we will test the “full” version of the tagger. This model achieves 85.9% accuracy on the test set data.

tagger2_ko = FullTron(load=False)

tagger2_ko.train(korean_train,save_loc = "full_feature_Kaist_train_perceptron.pickle")

#load pretrained model (if needed)
#tagger2_ko = FullTron(load = True, PICKLE = "full_feature_Kaist_train_perceptron.pickle")

tagged_ko_test2 = test_tagger(korean_test,tagger2_ko,tag_strip = True)

simple_accuracy_sent(korean_test,tagged_ko_test2) #test 1 (small set): 0.8593386448565183

Thai

First, we will train a model based on the simple feature set. This model achieves 85.8% accuracy for universal pos tags on the test set, which is rather impressive given the (relatively) tiny training set.

#Thai only comes with a single dataset, so we will have to divide it ourselves
thai_full = tupler(conllu_dicter(open("UD_Thai-PUD/th_pud-ud-test.conllu").read()),posname = "upos")

thai_train = thai_full[:667]
thai_test = thai_full[667:]

tagger_th = SimpleTron(load=False) #define tagger

tagger_th.train(thai_train,save_loc = "small_feature_ThaiPUD_train_perceptron.pickle") #train tagger on train_data, save the model as "small_feature_Browntrain_perceptron.pickle"

#load pretrained model (if needed)
#tagger_th = SimpleTron(load = True, PICKLE = "small_feature_ThaiPUD_train_perceptron.pickle")


tagged_th_test = test_tagger(thai_test,tagger_th,tag_strip = True)
print(simple_accuracy_sent(tagged_th_test,thai_test)) #0.8582687683676196

Next, we will test the “full” version of the tagger. This model achieves 88.1% accuracy on the test set data, which again is impressive given the small size of the training data.

tagger2_th = FullTron(load=False)

tagger2_th.train(thai_train,save_loc = "full_feature_ThaiPUD_train_perceptron.pickle")

#load pretrained model (if needed)
#tagger2_th = FullTron(load = True, PICKLE = "full_feature_ThaiPUD_train_perceptron.pickle")

tagged_th_test2 = test_tagger(thai_test,tagger2_th,tag_strip = True)

print(simple_accuracy_sent(thai_test,tagged_th_test2)) #test 1 (small set): 0.8815121560245792

Corpus-Linguistics-Working-Group