Skip to content

amaik/nietzsche-complete-work

Repository files navigation

Complete Work of Friedrich Nietzsche

He who has a why to live can bear almost any how.

F. Nietzsche

This is a dataset I created for some experimentation with language modelling.
The source of the dataset are scanned books from the internet archive. Since the source is a scan from old books the dataset is very noisy and there where challenges with cleaning it. I ended up writing a script to resolve the last typos and spelling mistakes by hand but deemed it too much work.

However, this might be helpful as an insight what can be done in cleaning a text dataset

Structure

The folder structure of the data is as follows:

  • data
    • nietzsche
      • bad
      • pre-cleanup
      • processed
      • raw

raw keeps the raw text files from https://archive.org/.

bad were books that are so unreadable I deemed them too much hassle for processing.

processed is the processed data. These are the same text files as in raw, but each sentence is in it's own line and has a and tag. The result of the script preprocess_nietzsche.py.

pre-cleanup is data in preparation for the second, manual cleaning stage (clean_nietzsche.py). This data is not finish and the script wasn't run.

Language model

The scripts train_nietzsche.py and test_nietzsche.py contain logic to train and test(empirically, no perplexity) a LM on the data. They need import from an unpublished project of mine which will be updatet in the future.

About

A dataset of the complete work by Friedrich Nietzsche.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages