View on GitHub

L2-Annotation-Project

L2 Speech POS and Dependency Annotation Project

Part Of Speech (POS) tag annotation manual

This document provides an initial explanation of the POS tags used. In this project, we will be using the Penn Tagset.

If you have questions when tagging, please follow this procedure:

POS tagging Scheme

Here, each tag is described and examples of each are also given.

Note that the following this style is used for examples.

Penn Tagset Description Example Notes
CC coordinating conjunction and , but, yet  
CD cardinal number 1, two Cardinal numbers (but not ordinal numbers such as “first”) are almost always tagged as CD. The Penn guidelines indicate that numbers can be tagged as JJ when that number is synonymous with an adjective (e.g., a 50-3 [wide/easy/handy] victory)or as an RB when the number is synonymous with an adverb (e.g., they won 50-3 [easily/handily]). These cases, however, are exceedingly rare. When in doubt, tag numbers as CD.
DT determiner the, a(n), no, every, another, any, that, these both and all are DT when they occupy the determiner position as in all roads or both times (see PDT).
EX existential there there is  
FW foreign word d'hoevre  
IN preposition in, of, like IN can be categorized as either ADP or SCONJ in UPOS.
IN subordinating conjunction although, when IN can be categorized as either ADP or SCONJ in UPOS.
JJ adjective green  
JJR adjective, comparative greener  
JJS adjective, superlative greenest  
LS list marker 1) a) B)  
MD modal could, will, may, would, shall, etc.  
NN noun, singular or mass table  
NNS noun plural tables  
NNP proper noun, singular John, CNN, Includes Acronyms (CNN, BBC, NATO, etc.)
NNPS proper noun, plural Vikings Includes Acronyms (CNN, BBC, NATO, etc.)
PDT predeterminer (i.e., determiner-like elements that precede an article or possessive pronouns) both the boys , all his marbles, both the girls, half the time, such a good time, quite a mess, rather a nuisance See DT carefully
POS possessive ending friend's , John's, the parents'  
PRP personal pronoun (includes reflexive) I, he, it, mine, yours Be careful not to confuse PRON (Pronouns) with PROPN (Proper nouns) in UPOS.
PRP$ possessive pronoun my, his  
RB adverb (typically end with -ly, but also includes degree words) however, usually, naturally, here, well, very, too A negative particle “not” is an RB in penn tagset, but it should be tagged as PART in Universal POS.
RBR adverb, comparative better  
RBS adverb, superlative best  
RP particle give up Verbal particles are categorized as ADP in UPOS, not PART.
SYM Symbols (Mathematical or scientific) π, ˚C  
TO to (all instances of “to”) to go, to him Note that in some cases, “to” is tagged as “IN” in the ONTONOTES corpus. This is incorrect because “to” should always be tagged as “TO”
UH interjection uhhuhhuhh  
VB verb, base form (subsumes imperatives, nfinitives, subjunctives) take  
VBD verb, past tense took “D” represents “-ed”
VBG verb, gerund/present participle taking “G” represents “-ing”
VBN verb, past participle taken “N” represents “-en”
VBP verb, present, non-3rd person sing. You take, They take, I takeetc. “P” represents “Present tense”
VBZ verb, present, 3rd person sing. takes “Z” represents the morpheme “-z” in 3 person singular “s”
WDT wh-determiner which  
WP wh-pronoun who, what, whom  
WP$ possessive wh-pronoun whose  
WRB wh-abverb where, when,why  
$ dollar sign $  
quotes ' " If there is a quote mark at the end of a sentence that is not previously matched (e.g., This pizza is delicious.") then do not tag it (it will be cleaned from the data later).
( Right Facing Bracket (  
( Left Facing Bracket (  
, comma ,  
. end sentence punctuation . ! ?  
: colons, semi-colons, ellipses, and hyphens : ; - ...  

Dealing with L2 usage

Often, utterances will include “errors”. Following the procedures in Berzak et al. (2016), words will be tagged based on their realized form, and not on the intended one (with a few caveats). Guidelines for tagging such instances are provided below:

Some problematic tags

The following section will outline some frequent problematic cases. This is NOT exhaustive, and if you think your questions is not answered, refer to the following manual PennTag POS tagging guideline.

Adverb (RB) or Particle (RP)?

The Penn tagging guidelines provide a number of test that can be used to distinguish these three tags on pages 9, 10, 11, and 21). A very small set of these are included below:

Prepositions (IN) are directly associated with a noun phrase, while particles (RP) and adverbs (RB) are not.

You cannot insert manner adverbs (e.g., calmly) between a verb and a particle.

Hyphenated words

Hyphenated words with nouns as the root should be tagged as JJ NN such as T_JJ shirt_NN, according to the tagging guidelines (page 12). This is the case even if the unhyphenated version of the utterance should be tagged as NN NN. In the case of a transcribed corpus, this becomes arbitrary, but we will base our tags on how the utterance was transcribed.

Proper noun NNP or NN?

As per the Penn tagging guidelines, capitalized words should only be counted as proper nouns when clearly referring to a proper noun (e.g., March, New York Times). When this is unclear, tag nouns as common nouns.

Words such as “state” (as in Washington S/state) and “prefecture” should be counted as common nouns except when used as part of name (e.g., “State Department”, “Secretary of State”).

Some problematic words/phrases

“‘s”

's should be tagged as POS if used as a possessive and VBZ if used as a verb.

If 's is incorrectly as a plural, leave the tag blank and add it to the transcription errors sheet

“about” (RB, IN, or RP?)

between

Between should be tagged as IN.

“both”, “either”, “all”, etc. (CC, DT, or PDT?)

If both, either is directly modifying a noun, they are determiner (DT).

If they precede an article or possessive pronouns, they are predeterminer (PDT).

If both or either are used with coordinating conjunctions, they are CC.

“cell phone”

The guidelines are a bit unclear on how “cell phone” should be tagged. Based on the corpus, however, cell phone should always be tagged as cell_NN phone_NN.

everyday (JJ, NN, or RB?)

Everyday should be tagged as NN and not as RB following the Penn Guidelines on pages 18-19. If “everyday” directly modifies a noun, then it should be tagged as JJ. The utterance “every day” should be tagged as every_DT day_NN.

“first” (JJ, RB or LS)

first (and other ordinal numbers) is most commonly tagged as an adjective JJ as in the first issue. When used to introduce a sentence, first is almost always tagged as RB as in First, the president was .... It can also be tagged as LS when used in a list (but this is rare and confined to short, focused lists).

“go out” (IN, RB, or RP)

Potential phrasal verbs are generally tagged inconsistently in the corpora between IN, RB, and RP.

In the construction go out, some of this inconsistency can be eliminated by the following guidelines. Note: These guidelines should not be extended to all phrasal verbs. In these guidelines, the RB is excluded for the sake of consistency. For other phrasal verbs, the RB tag should be considered.

If part of the sentence with the word out can be replaced with the word there, and the sentence retains the same meaning, then out is tagged as IN. In this context, out modifies a noun phrase.

If replacing part of the sentence containing the word out with the word there either changes the meaning of the sentence, or makes an ungrammatical sentence, then out is tagged as RP. In this context, out is a particle of a phrasal verb.

“have” (VB* or MD)

“have” has three uses, and all should be tagged with the appropriate VB* tag for its use (following the Penn guidelines, page 17 [19 in .pdf]):

Hyphenated word special mentions.

The following guidelines concern niche usage of hyphens. The goal of these guidelines is to make hyphen usage consistent in these niche situations.

In the word good - bye, both good and bye should be tagged as UH.

ordinal numbers (ex: first, second, third) which are preceded by a CD and a hyphen should be tagged as JJ.

If so-so is has a meaning akin to okay, then both so’s should be tagged as JJ.

“like” (VB*, IN, or UH)

“little” (JJ or RB?)

“much” (JJ or RB?)

“now” (RB or UH)

“of course” (IN NN or RB RB?)

Following the WSJ texts and the ESL corpus, “of course” should be tagged as of_RB course_RB when used as an adverbial phrase as in “Of_RB course_RB, Washington had n’t…”

In fairly rare circumstances, “of course” can also be used as in preposition + noun constructions as in the phrase, “As a matter of_IN course_NN, we check the corpus”

“one” (CD or NN?)

Sometimes it is unclear whether one is cardinal number or a noun. In general, it should be tagged as a cardinal number (CD) even when it is not clearly that of a numeral

only (JJ or RB?)

only should be tagged as JJ if it directly modifies a noun or noun phrase. If it modifies a sentence (or a verb), it should be tagged as RB.

“police” (NN or NNS?)

Some nouns like police have identical singular and plural forms. See NN vs NNS in the POS Tagging Manual for general rules in testing NN vs NNS. Below are some examples of tests.

In the construction call the police, if the word police can be replaced by the word them, it is tagged as NNS.

If police is a subject of a VBZ, then it is tagged as NN.

If police is part of a singular compound noun that can be replaced by a singular object pronoun like him, her, or it, then police is tagged as NN. The word there also works as a test for locations.

“so” (RB, CC, or IN?)

So is quite versatile and can therefore be difficult to tag. Note that so is not tagged consistently in the corpus.

So will often be used as an adverb (RB) as in “that pizza was so_RB good”

So can also be used as a coordinating (CC) or subordinating (IN) conjunction.

so is tagged as an adverb (RB) when it is the first part of a sentence.

“sort of” and “kind of”

sort of and kind of shoud be tagged as NN + IN in cases such as “They had some kind_NN of_IN tool”

However, when used as an adverbial, they should be tagged as RB + RB as in “They sort_RB of_RB ran away”,

“that”

that is tagged as DT in the following circumstances:

1) that is a determiner of a noun.

2) that is a subject, and could be replaced by “it”.

When that introduces a relative clause, it is tagged as WDT. A relative clause is a clause which modifies a noun. This means that is most commonly tagged as WDT when preceded by a noun.

If the word that can be replaced by the word which, whom, or who, then it should be tagged as WDT.

that is sometimes tagged as RB when it can be replaced by the word very.

When that doesn’t meet the requirements to be tagged as DT, WDT, or RB, then it should be tagged as a subordinating conjunction IN.

Informal contractions like wanna

Ideally, informal contractions like wanna will be transcribed as 2 separate tokens. This way, they can be tagged separately.

Informal contractions being transcribed as a singular token is a transcription error. In this case, they should be tagged solely based off of the verb.

Other informal contractions include gotta and gonna.

“Yen” (NN or NNS)

Yen should be tagged as NNS as in “one million Yen_NNS”, unless it is explicitly used as NN as in “The value of the Yen_NN is increasing”