Skip to content

tcarney57/DownloadAll.py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

DownloadAll.py

Description: This Python (v3.6) program runs from the command line and downloads the HTML markup from the specified URL ("parent page") and all URLs it links to. HTML content from each URL is appended to a single file (default: download.html), saved as separate HTML files, or saved a separate PDF files, in the current directory in the order the links appear in the parent page. No "a href" links contained in the resulting file are altered, so only anchor links will work in the HTML or PDF files (in a combined HTML file, duplicate anchor names could be problematic). All other relative links will be dead. Absolute URL links should still be live.

Intent: DownloadAll.py is one of several tools for downloading and processing multipage HTML documents for either printout and/or loading onto a tablet reader. This follows the UNIX/LINUX philosophy of creating and using small single-purpose applications instead of large catchall ones. The entire collection of these tools is intended for personal "fair use" of online materials and must not be used for other purposes.

Usage:

DownloadAll.py [-h] [-o [OUTFILE]] [-s] [-p] url2open [sitepath]

Downloads all the pages from the links at the entered URL ("parent page").
Pages are saved as either HTML or PDF files in the directory/folder in which
DownloadAll is invoked. Filenames are derived from the source HTML page names
(http://.../filename.html). Files by the same name in the current directory
will be overwritten.

positional arguments:

url2open            Enter the full URL of the parent page

sitepath            Optional. If omitted, the site url path of the parent
                    page will be appended to relative links

optional arguments:

-h, --help          show this help message and exit

-o [OUTFILE], --outfile [OUTFILE]

                    Optional output file name (default is download.html).
                    Usage: --outfile=filename
					
-s, --separate      Optional flag. If present, pages are saved to separate
                    files instead of appended to one combined one. File
                    names are taken from each downloaded page
                    (http://.../filename.html).
					
-p, --pdf           Optional flag. If present, pages are saved as separate
                    PDF files (http://.../filename.html saved as
                    filename.pdf). The -o and -s options have no effect
                    when -p is used. Since external programs exist to
                    combine pdf files, there is no option here to save a
                    combined PDF file.

About

A python script that runs from the command line to download all pages linked to from a given URL. Saves the content of <body> tags on the pages as either HTML (as separate files or one combined file) or PDF. Images and other non-text objects are not included.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages