syntactic-text-similarity

Investigates the performance of state of the art syntactic text similarity algorithms such as levenshtein, hamming and jaccard distance for matching similar textual context across multiple versions of a Dutch literary work.

File descriptions for this repo:

├── semantic-similarity-syntacticmetrics.ipynb ** Jupyter notebook to generate similarity data **
├── analyse-syntactic-results.ipynb ** Jupyter notebook to analyse the data generated by the previous notebook in the list **
├── sampling-strategy.ipynb ** Jupyter notebook to select a sample of text pairs to provide to human validators of textual similarity (to measure the performance of the text similarity metrics) **
├── needle.py ** Python implementation of the [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) for text similarity **
├── core.py ** Required by needle.py for Needleman-Wunsch algorithm **
├── preprocess.py ** basic text preprocessing script **
└── data/ ** Folder containing results data for analysis and other required utility files **
    ├── results/syntactic/collaite-text-similarity-experiment-files-v1.zip ** archive of files required to perform human validation and measure text similarity performance **
    └── stopwords-iso.json ** standard list of Dutch language stop words **

Required input data:

In the CollAIte project we are applying this analysis to multiple versions of the "Projectielantaarn Aladin" manuscript by Raymond Brulez. The required input data is four plain text .txt files indicating the different versions (also called witnesses) of "Projectielantaarn Aladin". Each witness is referred to as W1, W2, W3 and W4.

Running the script:

Requirements:

Python 3.7+
Git

Steps:

Clone this repo (git clone [email protected]:collaite/syntactic-text-similarity.git)
Change into the syntactic-text-similarity/ directory
Run pip install -r requirements.txt in the terminal or console (optionally create and activate a virtual environment using something like venv before running this command)
Ensuring that you have the required input data as indicated above, you may execute the semantic-similarity-syntacticmetrics.ipynb notebook to run the text similarity calculations.
Thereafter, run the analyse-syntactic-results.ipynb as desired to explore the results dataset.
Run sampling-strategy.ipynb for selecting a manageable sample of the text similarity calculations for validation by human text scholars.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

syntactic-text-similarity

File descriptions for this repo:

Required input data:

Running the script:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyse-syntactic-results.ipynb		analyse-syntactic-results.ipynb
core.py		core.py
needle.py		needle.py
preprocess.py		preprocess.py
sampling-strategy.ipynb		sampling-strategy.ipynb
semantic-similarity-syntacticmetrics.ipynb		semantic-similarity-syntacticmetrics.ipynb

Folders and files

Latest commit

History

Repository files navigation

syntactic-text-similarity

File descriptions for this repo:

Required input data:

Running the script:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages