This is the code base for a MSc Artificial Intelligence dissertation at the University of Edinburgh. The best readme for the code is the dissertation itself, so I'll be concise.
Cognates are words in different languages that have evolved from the same proto-form. They often look and sound similar (e.g., Danish nat and Lithuanian naktis, both meaning night). Sometimes, however, they seem to be completely unrelated (e.g., Faroese hjarta and Lithuanian širdis for heart). Historical linguists use the comparative method (a bunch of heuristics, really) to figure out if a pair of words is related genetically. The method is very precise, but also extremely time-consuming. With around 3,000 languages (out of 7,000+ currently spoken in the world) expected to be dead by the end of the century, this is not good enough. This project is an attempt to automate cognate identification, focusing specifically on little-known under-resourced languages.
Data is stored in the input directory:
- input.txt: the raw Comparative Indo-European Database (Dyen, Kruskal, & Black, 1992). The only difference from the original is the removal of the introduction to the file.
- clean.txt: a cleaned-up version of the same file. Not used in the project, but a similar representation is stored in memory whenever the raw file is read. Good for use in other projects or manual browsing.
- POS.txt: POS tags for each of the 200 meanings from the Comparative Indo-European Database. Assigned manually by the author.
- consonants.txt: a list of Roman characters considered to be consonants.
- dolgo.txt: Dolgopolsky's (1986) sound classes and their corresponding Roman characters.
- script.py: controls the flow of the program.
- constants.py: exactly that.
- reader.py: reads the Comparative Indo-European Database, performs data cleaning.
- pairer.py: pairs words within each meaning, creating positive and negative examples for classification. Divides the paired data into training, development, and test sets.
- extractor.py: given a pair of words, extracts various features (string similarity, letter correspondences, POS tags, and language groups).
- learner.py: implements SVM and logistic regression classifiers, hierarchical agglomerative clustering, and a number of evaluation metrics.
Ernesta Orlovaitė