He who has a why to live can bear almost any how.
F. Nietzsche
This is a dataset I created for some experimentation with language modelling.
The source of the dataset are scanned books from the internet archive.
Since the source is a scan from old books the dataset is very noisy and there where challenges with cleaning it.
I ended up writing a script to resolve the last typos and spelling mistakes by hand but deemed it too much work.
However, this might be helpful as an insight what can be done in cleaning a text dataset
The folder structure of the data is as follows:
- data
- nietzsche
- bad
- pre-cleanup
- processed
- raw
- nietzsche
raw keeps the raw text files from https://archive.org/.
bad were books that are so unreadable I deemed them too much hassle for processing.
processed is the processed data. These are the same text files as in raw, but each sentence is in it's own line and has a and tag. The result of the script preprocess_nietzsche.py.
pre-cleanup is data in preparation for the second, manual cleaning stage (clean_nietzsche.py). This data is not finish and the script wasn't run.
The scripts train_nietzsche.py and test_nietzsche.py contain logic to train and test(empirically, no perplexity) a LM on the data. They need import from an unpublished project of mine which will be updatet in the future.