Skip to content

CrateOrg/parseidon

Repository files navigation

Parseidon

Parseidon is a document parsing text extracting tool written in Python. The purpose of parseidon is to let the user extract strings that match a desired predefined format using either regex or PEG for pattern matching. Additionally the filter mode of parseidon uses vocabulary data to filter out common words, leaving uncommon strings that might be of interest. The pattern matching and the filtering functionality can also be used together in the find mode, letting the filter assist the user in identifying words not covered by their regexes or PEGs

Modes

Parseidon consists of four separate modes:

  • regex_mode performs pattern matching on the document strings using regular expressions.
    • A more detailed description can be found here regex_mode
  • pegparse_mode essentially has the same functionality as regex_mode except it utilizes parsing expression grammar(PEG) rules to find matches.
  • filter_modefilters out common dictionary items, leaving the unrecognized potentially interesting words for manual inspection by the user.
  • A more detailed description can be found here filter_mode
  • find_modecombines the functionality of filter_mode with either regex_mode or pegparse_mode, highlighting both pattern matches and unrecognized strings.
  • A more detailed description can be found here find_mode

Plugins

The project includes plugins in addition to the core project. Below follows a list of implemented plugins.

  • parseidon-headings-plugin

    • Removes numbered headings that could falsely be identified as IPv4-adresses
  • parseidon-hyphen-plugin

    • Determines if a word containing a hyphen is correct or if the hyphen exists only due to the line width being exceeded by the word.

These are described in more detail in headings_plugin and hyphen_plugin.

Documentation

In addition to this document, the project includes a documentation folder which contain information about installation, usage, plugins and language resources.

Contact

For questions, feedback, or general inquiries, please contact us at [email protected].

Data attribution

For attribution of language resources used in this project, please refer to third party notices. For information on how the respective sources are used, please see language resources.

About

A tool for automating the process of extracting relevant information from text documents

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors