This project supports analyses of noun and verb distributions in the CAREER corpus. To ensure accuracy in part-of-speech (POS) classification, we are using two complementary workflows:
- Human validation of CLAN’s automated POS tagging across the full corpus.
- Manual POS tagging of a subset of files.
The first workflow evaluates the accuracy of CLAN’s automated token recognition and morphological tagging processes. The second workflow provides a reliability check on the number of missed tokens that should have been tagged by CLAN’s automated process. Together, these procedures allow us to quantify tagging accuracy and reliability for downstream analyses.
This document contains:
- Step-by-step instructions for human validation of CLAN POS tagging
- Step-by-step instructions for manual POS tagging
- Links to all relevant trackers and reference guides
Follow the appropriate workflow depending on which phase of the project you are assigned to.
-
In CAREER POS tagging > coding folder, open CAREER POS tagging file tracker spreadsheet.
-
Locate and open the associated .xlsx file (under
Tagged tokens URLcolumn) you last worked on or the next un-tagged file. If you are starting on a fresh file, add your initials toAnnotatorand today's date toFirst date worked oncolumns. -
In the .xlsx file, most columns are pre-filled. The two columns that you will be filling out are
is_correct_tagandcorrected_tag. -
For each token tagged in the .xlsx file:
- Determine if the
tagproduced by the automated process is correct by using context clues and reading the full utterance in theutterancecolumn. If correct, input "y" inis_correct_tagcolumn and "0" or leave empty incorrected_tagcolumn. - If
tagis not correct, input "n" inis_correct_tagcolumn and the correct tag shorthand incorrected_tagcolumn. The tags that we are using in the current workflow are 1. noun (n), 2. proper noun (n:prop), 3. plural noun (n:pt), 4. letter noun (n:let), 5. verb (v), 6. auxiliary verb (aux), and 7. modal verb (mod). See CLAN morphology tags for the full list of morphological tags.- For some tricky cases of noun and verb tags, see Noun/verb special cases guide for examples.
- In the case that the correct morphological tag of a token cannot be inferred from context clues, input "u" in in
is_correct_tagcolumn and "0" or leave empty incorrected_tagcolumn. Please use this option sparingly! - Double check your spelling.
- Determine if the
-
Work through the entire .xlsx file via step 4. When finished, go back to the CAREER POS tagging file tracker spreadsheet and add today's date to
Last date worked oncolumn. Add comments if needed.
-
In CAREER POS tagging > coding folder, open CAREER manual POS file tracker spreadsheet.
-
Locate and open the associated .xlsx file (under
File URLcolumn) you last worked on or the next un-tagged file. If you are starting on a fresh file, add your initials toAnnotatorand today's date toFirst date worked oncolumns. -
For each utterance in the .xlsx file:
- Identify all nouns (including proper nouns) and verbs present.
- In the
tokencolumn, enter each token exactly as it appears in the original utterance. - Double check your spelling.
- Provide the appropriate tag in the corresponding
tagcolumn. The tags that we are using in the current workflow are 1. noun (n), 2. proper noun (n:prop), 3. plural noun (n:pt), 4. letter noun (n:let), 5. verb (v), 6. auxiliary verb (aux), and 7. modal verb (mod). See CLAN morphology tags for the full list of morphological tags.- For some tricky cases of noun and verb tags, see Noun/verb special cases guide for examples (note that bullet point 5 is irrelevant to this workflow).
- If an utterance doesn't contain any nouns or verbs, input "0" in the last two columns. Never leave a cell empty!
-
If multiple nouns or verbs occur in a single utterance:
- Add an additional row for each extra token (e.g., if there are 4 total nouns and verbs in one utterance, you need to add 3 additional rows below the original row).
- In the additional row(s), copy over the first three columns (
file_name,speaker,utterance) from the original row. - You should never edit the contents in the first three columns, only copy and paste contents when needed!
-
Repeat steps 3 & 4 until you have finished a .xlsx file.
-
Before you close the file, do a systematic check of contractions to ensure you didn't miss any tokens (ctrl/command +
Fkey search for all cases of apostrophes). -
When finished, go back to the CAREER manual POS file tracker spreadsheet and add today's date to
Last date worked oncolumn.
- Q: Do I parse contractions? A: Yes. (e.g., "I'll" in "I'll do it" is tokenized as "will" (mod), "don't" in "I don't like it" is tokenized as "do" (aux), etc.)
- Q: How do I parse "be" verbs? A: Tokenize all of them as "be." (e.g., "I'm" in "I'm happy" is tokenzed as "be" (v), "were" in "they were there" is tokenized as "be" (v), etc.)