Parseidon is a document parsing text extracting tool written in Python. The purpose of parseidon is to let the user extract strings that match a desired predefined format using either regex or PEG for pattern matching. Additionally the filter mode of parseidon uses vocabulary data to filter out common words, leaving uncommon strings that might be of interest. The pattern matching and the filtering functionality can also be used together in the find mode, letting the filter assist the user in identifying words not covered by their regexes or PEGs
Parseidon consists of four separate modes:
regex_modeperforms pattern matching on the document strings using regular expressions.- A more detailed description can be found here regex_mode
pegparse_modeessentially has the same functionality asregex_modeexcept it utilizes parsing expression grammar(PEG) rules to find matches.- A more detailed description can be found here pegparse_mode
filter_modefilters out common dictionary items, leaving the unrecognized potentially interesting words for manual inspection by the user.
- A more detailed description can be found here filter_mode
find_modecombines the functionality offilter_modewith eitherregex_modeorpegparse_mode, highlighting both pattern matches and unrecognized strings.
- A more detailed description can be found here find_mode
The project includes plugins in addition to the core project. Below follows a list of implemented plugins.
-
parseidon-headings-plugin
- Removes numbered headings that could falsely be identified as IPv4-adresses
-
parseidon-hyphen-plugin
- Determines if a word containing a hyphen is correct or if the hyphen exists only due to the line width being exceeded by the word.
These are described in more detail in headings_plugin and hyphen_plugin.
In addition to this document, the project includes a documentation folder which contain information about installation, usage, plugins and language resources.
For questions, feedback, or general inquiries, please contact us at [email protected].
For attribution of language resources used in this project, please refer to third party notices. For information on how the respective sources are used, please see language resources.