Cognates

This is the code base for a MSc Artificial Intelligence dissertation at the University of Edinburgh. The best readme for the code is the dissertation itself, so I'll be concise.

Cognates are words in different languages that have evolved from the same proto-form. They often look and sound similar (e.g., Danish nat and Lithuanian naktis, both meaning night). Sometimes, however, they seem to be completely unrelated (e.g., Faroese hjarta and Lithuanian širdis for heart). Historical linguists use the comparative method (a bunch of heuristics, really) to figure out if a pair of words is related genetically. The method is very precise, but also extremely time-consuming. With around 3,000 languages (out of 7,000+ currently spoken in the world) expected to be dead by the end of the century, this is not good enough. This project is an attempt to automate cognate identification, focusing specifically on little-known under-resourced languages.

Data

Data is stored in the input directory:

input.txt: the raw Comparative Indo-European Database (Dyen, Kruskal, & Black, 1992). The only difference from the original is the removal of the introduction to the file.
clean.txt: a cleaned-up version of the same file. Not used in the project, but a similar representation is stored in memory whenever the raw file is read. Good for use in other projects or manual browsing.
POS.txt: POS tags for each of the 200 meanings from the Comparative Indo-European Database. Assigned manually by the author.
consonants.txt: a list of Roman characters considered to be consonants.
dolgo.txt: Dolgopolsky's (1986) sound classes and their corresponding Roman characters.

Code

script.py: controls the flow of the program.
constants.py: exactly that.
reader.py: reads the Comparative Indo-European Database, performs data cleaning.
pairer.py: pairs words within each meaning, creating positive and negative examples for classification. Divides the paired data into training, development, and test sets.
extractor.py: given a pair of words, extracts various features (string similarity, letter correspondences, POS tags, and language groups).
learner.py: implements SVM and logistic regression classifiers, hierarchical agglomerative clustering, and a number of evaluation metrics.

Libraries

Author

Ernesta Orlovaitė

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cognates

Data

Code

Libraries

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
input		input
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
extractor.py		extractor.py
learner.py		learner.py
output.py		output.py
pairer.py		pairer.py
reader.py		reader.py
script.py		script.py

Folders and files

Latest commit

History

Repository files navigation

Cognates

Data

Code

Libraries

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages