Complete Work of Friedrich Nietzsche

He who has a why to live can bear almost any how.

F. Nietzsche

This is a dataset I created for some experimentation with language modelling.
The source of the dataset are scanned books from the internet archive. Since the source is a scan from old books the dataset is very noisy and there where challenges with cleaning it. I ended up writing a script to resolve the last typos and spelling mistakes by hand but deemed it too much work.

However, this might be helpful as an insight what can be done in cleaning a text dataset

Structure

The folder structure of the data is as follows:

data
- nietzsche
  - bad
  - pre-cleanup
  - processed
  - raw

raw keeps the raw text files from https://archive.org/.

bad were books that are so unreadable I deemed them too much hassle for processing.

processed is the processed data. These are the same text files as in raw, but each sentence is in it's own line and has a and tag. The result of the script preprocess_nietzsche.py.

pre-cleanup is data in preparation for the second, manual cleaning stage (clean_nietzsche.py). This data is not finish and the script wasn't run.

Language model

The scripts train_nietzsche.py and test_nietzsche.py contain logic to train and test(empirically, no perplexity) a LM on the data. They need import from an unpublished project of mine which will be updatet in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/nietzsche		data/nietzsche
.gitignore		.gitignore
README.md		README.md
clean_nietzsche.py		clean_nietzsche.py
preprocess_nietzsche.py		preprocess_nietzsche.py
stats.py		stats.py
test_nietzsche.py		test_nietzsche.py
train_nietzsche.py		train_nietzsche.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Complete Work of Friedrich Nietzsche

Structure

Language model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Complete Work of Friedrich Nietzsche

Structure

Language model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages