Skip to content

collaite/syntactic-text-similarity

Repository files navigation

syntactic-text-similarity

Investigates the performance of state of the art syntactic text similarity algorithms such as levenshtein, hamming and jaccard distance for matching similar textual context across multiple versions of a Dutch literary work.

File descriptions for this repo:

├── semantic-similarity-syntacticmetrics.ipynb ** Jupyter notebook to generate similarity data **
├── analyse-syntactic-results.ipynb ** Jupyter notebook to analyse the data generated by the previous notebook in the list **
├── sampling-strategy.ipynb ** Jupyter notebook to select a sample of text pairs to provide to human validators of textual similarity (to measure the performance of the text similarity metrics) **
├── needle.py ** Python implementation of the [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) for text similarity **
├── core.py ** Required by needle.py for Needleman-Wunsch algorithm **
├── preprocess.py ** basic text preprocessing script **
└── data/ ** Folder containing results data for analysis and other required utility files **
    ├── results/syntactic/collaite-text-similarity-experiment-files-v1.zip ** archive of files required to perform human validation and measure text similarity performance **
    └── stopwords-iso.json ** standard list of Dutch language stop words **

Required input data:

In the CollAIte project we are applying this analysis to multiple versions of the "Projectielantaarn Aladin" manuscript by Raymond Brulez. The required input data is four plain text .txt files indicating the different versions (also called witnesses) of "Projectielantaarn Aladin". Each witness is referred to as W1, W2, W3 and W4.

Running the script:

Requirements:

  1. Python 3.7+
  2. Git

Steps:

  1. Clone this repo (git clone [email protected]:collaite/syntactic-text-similarity.git)
  2. Change into the syntactic-text-similarity/ directory
  3. Run pip install -r requirements.txt in the terminal or console (optionally create and activate a virtual environment using something like venv before running this command)
  4. Ensuring that you have the required input data as indicated above, you may execute the semantic-similarity-syntacticmetrics.ipynb notebook to run the text similarity calculations.
  5. Thereafter, run the analyse-syntactic-results.ipynb as desired to explore the results dataset.
  6. Run sampling-strategy.ipynb for selecting a manageable sample of the text similarity calculations for validation by human text scholars.

About

Analysis of the performance of state of the art syntactic text similarity algorithms on versions of a Dutch literary work

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors