Corpus Linguistics Working Group
Spring 2021
University of Oregon, Linguistics Department
Spring 2021 Schedule
- Week 1: Getting Organized
- Week 2: Introduction to Part of Speech (POS) Tagging
- Overview of how POS tagging works
- Begin building a basic POS tagger
- Access POS Tagging Tutorial 1 here
- Note, if you need to brush up on your Python skills, check out this tutorial
- Weeks 3,4: Introduction to POS Tagging (pt. 2)
- Building better feature sets
- Building simple tagging algorithms
- Access POS Tagging Tutorial 2 here
- Week 5: Introduction to POS Tagging (pt. 3)
- Looking at accuracy (precision and recall)
- Access POS Tagging Tutorial 3 here
- Week 6: POS Tagging with Machine Learning
- Week 7 (Kris, TBD): POS Tagging with Machine Learning (part 2)
- Week 8 (Kris, TBD): State of the Art POS Tagging with (simple) Neural Nets
- Week 9 (Kris): Dealing with Languages other than English
- Week 10 (TBD): TBD
Winter 2021 Schedule
- Week 1 (Kris): Getting Organized
- Homework:
- Sign up for a Github account
- Create a sample repository
- Take a look at the markdown guidelines (I promise, it is super easy)
- Week 2 (Wesley): Prospects for a (Semi-)Automated Papuan Comparative Linguistics and Reconstruction (Hammarström, 2019)
- Week 3 (Masaki): A corpus-driven approach to formulaic language in English Multi-word patterns in speech and writing (Biber, 2009). Note, skim this article for background; Workshop will be extracting formulaic language beyond n-grams.
- Week 4 (Keegan): Galves et al. 2017: Annotating a Polysynthetic Language
- Week 5 (Cece): Corpora, Databases, and Internet Resources: Corpus Phonology with Speech Resources Using The Internet For Collecting Phonological Data Speech Manipulation, Synthesis, and Automatic Recognition in Laboratory Phonology Phonotactic Patterns in Lexical Corpora
- Week 6 (Shayleen): Corpus linguistics and language documentation: challenges for collaboration (Cox, 2011)
- Week 7 (Brittany): Annotating the ICE corpora pragmatically – preliminary issues & steps (Weisser, 2017)
- Week 8 (Ksenia): Distributional Semantics article (Boleda, 2019) + word2vec coding in class
- Week 9 (Min): Min will be leading us through a collexeme analysis. For more about collexeme analysis check out Gries et al., 2005 - focus on study 2 and/or Hilpert, 2006 - shorter paper
- Week 10 (Sabine): Sabine will be discussing the calculation of syntactic complexity (and in particular, the use of TAASSC). Sabine will be referencing this article: Kyle & Crossley (2018)
Small Project for Weeks 7-10: Part of Speech (POS) Tagging
- Week 7 - Introduction to POS Tagging
- Week 8 - POS tagging: More features, more precise accuracy figures
- Week 9 - Applying machine learning algorithms to POS tagging
Resources (readings, code, etc.)
- Markdown Cheatsheet
- GitHub Pages Information
- More to be added as the term progresses