Parses seclists.org raw.html files into the following subfiles:
-
.reply.body.txt
Full content of reply (without seclists.org wrapper page html)
-
.reply.title_body.txt
Title of reply + full content of reply
-
.reply.body_no_signature.txt
Full content of reply, with attempt to strip out signature
-
.reply.title_body_no_signature.txt
Title of reply + above
-
.reply.body_tags.txt
File containing analysis of tags in raw.html file. Content is in JSON format.
* tags: html tag types found in reply, along with count * sites: domains of sites referenced in reply, along with countExample:
{"tags": {"pre": 2, "a": 1}, "sites": {"pentestmag.com": 1}}
-d <directory>, parse entire directory, e.g., -d ./2011_01
-f <filename>, parse single raw file, e.g. -f ./2011_Jan_0.raw.html
Example usage: $ python seclists_reply_parse.py -d ./2011_01
For more flexiblity, import this library, and use the following functions:
Parse .raw.html files
Args:
- path: str, directory containing .raw.html files
Parse individual message.
Args:
- filename: str
Parses month index raw.html into csv file. This also pulls data from the referenced replies, to obtain full date and author information.
The CSV file contains five columns:
- id
- title: Subject of reply
- date: e.g. 2005-01-05T00:53:02+00:00 format
- author: Name and email, as supplied by author
- parent: the id of the parent thread email; blank if this is a parent thread
-f <filename>, parse single raw file, e.g. -f ./2011_Jan_0.raw.html
Example usage: $ python seclists_index_parse.py -f ./2011_Jan_0.raw.html