Parseidon

Parseidon is a document parsing text extracting tool written in Python. The purpose of parseidon is to let the user extract strings that match a desired predefined format using either regex or PEG for pattern matching. Additionally the filter mode of parseidon uses vocabulary data to filter out common words, leaving uncommon strings that might be of interest. The pattern matching and the filtering functionality can also be used together in the find mode, letting the filter assist the user in identifying words not covered by their regexes or PEGs

Modes

Parseidon consists of four separate modes:

regex_mode performs pattern matching on the document strings using regular expressions.
- A more detailed description can be found here regex_mode
pegparse_mode essentially has the same functionality as regex_mode except it utilizes parsing expression grammar(PEG) rules to find matches.
- A more detailed description can be found here pegparse_mode
filter_modefilters out common dictionary items, leaving the unrecognized potentially interesting words for manual inspection by the user.

A more detailed description can be found here filter_mode

find_modecombines the functionality of filter_mode with either regex_mode or pegparse_mode, highlighting both pattern matches and unrecognized strings.

A more detailed description can be found here find_mode

Plugins

The project includes plugins in addition to the core project. Below follows a list of implemented plugins.

parseidon-headings-plugin
- Removes numbered headings that could falsely be identified as IPv4-adresses
parseidon-hyphen-plugin
- Determines if a word containing a hyphen is correct or if the hyphen exists only due to the line width being exceeded by the word.

These are described in more detail in headings_plugin and hyphen_plugin.

Documentation

In addition to this document, the project includes a documentation folder which contain information about installation, usage, plugins and language resources.

Contact

For questions, feedback, or general inquiries, please contact us at [email protected].

Data attribution

For attribution of language resources used in this project, please refer to third party notices. For information on how the respective sources are used, please see language resources.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
ci		ci
docs		docs
parseidon_headings_plugin		parseidon_headings_plugin
parseidon_hyphen_plugin		parseidon_hyphen_plugin
src/parseidon		src/parseidon
tests		tests
third_party_notices		third_party_notices
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pytest.ini		.pytest.ini
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compile_grammars.py		compile_grammars.py
create_trie.py		create_trie.py
generate_english_trie.py		generate_english_trie.py
pyproject.toml		pyproject.toml
update_trie.py		update_trie.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parseidon

Modes

Plugins

Documentation

Contact

Data attribution

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parseidon

Modes

Plugins

Documentation

Contact

Data attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages