Investigates the performance of state of the art syntactic text similarity algorithms such as levenshtein, hamming and jaccard distance for matching similar textual context across multiple versions of a Dutch literary work.
├── semantic-similarity-syntacticmetrics.ipynb ** Jupyter notebook to generate similarity data **
├── analyse-syntactic-results.ipynb ** Jupyter notebook to analyse the data generated by the previous notebook in the list **
├── sampling-strategy.ipynb ** Jupyter notebook to select a sample of text pairs to provide to human validators of textual similarity (to measure the performance of the text similarity metrics) **
├── needle.py ** Python implementation of the [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) for text similarity **
├── core.py ** Required by needle.py for Needleman-Wunsch algorithm **
├── preprocess.py ** basic text preprocessing script **
└── data/ ** Folder containing results data for analysis and other required utility files **
├── results/syntactic/collaite-text-similarity-experiment-files-v1.zip ** archive of files required to perform human validation and measure text similarity performance **
└── stopwords-iso.json ** standard list of Dutch language stop words **In the CollAIte project we are applying this analysis to multiple versions of the "Projectielantaarn Aladin" manuscript by Raymond Brulez. The required input data is four plain text .txt files indicating the different versions (also called witnesses) of "Projectielantaarn Aladin". Each witness is referred to as W1, W2, W3 and W4.
Requirements:
- Python 3.7+
- Git
Steps:
- Clone this repo (
git clone [email protected]:collaite/syntactic-text-similarity.git) - Change into the
syntactic-text-similarity/directory - Run
pip install -r requirements.txtin the terminal or console (optionally create and activate a virtual environment using something like venv before running this command) - Ensuring that you have the required input data as indicated above, you may execute the
semantic-similarity-syntacticmetrics.ipynbnotebook to run the text similarity calculations. - Thereafter, run the
analyse-syntactic-results.ipynbas desired to explore the results dataset. - Run
sampling-strategy.ipynbfor selecting a manageable sample of the text similarity calculations for validation by human text scholars.