Render website to ebook to make it easier to read on devices.
pip install colusaFor development:
pip install -e ".[all]"First of all, we need to generate a configuration file for colusa to work on.
colusa has builtin command to generate a template configuration as starter.
Run following command to generate configuration file:
$ colusa init new_ebook.jsoncolusa will generate a configuration file as below:
{
"title": "__fill the title__",
"author": "__fill the author__",
"version": "v1.0",
"homepage": "__fill url to home page__",
"output_dir": "__fill output dir__",
"urls": []
}We have to modify the configuration file to fill up valid information.
You can add URLs to the config manually, or use the add-url command to append them from the terminal:
# Append a plain URL
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/
# Auto-fetch the page title
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ --fetch-title
# Supply metadata explicitly
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ \
--title "First Principles" --author "Farnam Street" --published "2018-04-01"
# For multi-part books, specify the target part
$ colusa add-url new_ebook.json https://fs.blog/2018/04/first-principles/ --part "Part 1"Alternatively, edit the config file directly and add URLs to the urls field.
Example for final configuration:
{
"title": "The Great Mental Models",
"author": "Farnam Street Media Inc",
"version": "v1.0",
"homepage": "https://fs.blog",
"output_dir": "fsblog",
"urls": [
"https://fs.blog/2018/04/first-principles/",
"https://fs.blog/2016/04/second-order-thinking/",
"https://fs.blog/2017/06/thought-experiment/",
"https://fs.blog/2018/05/probabilistic-thinking/",
"https://fs.blog/2019/12/survivorship-bias/"
]
}We can update ebook content by modifying the urls, by adding or removing url in the urls, the result ebook will be changed.
After adding or removing url in urls, we need to invoke colusa to have it regenerate ebook content. Run following command at terminal:
$ colusa generate new_ebook.jsonBy invoking above command, colusa will download webpages (specified in urls), parse, transform them to asciidoc format, and save them to output_dir. colusa also create a neccessary information for ebook compilating at later steps.
If asciidoctor tools are installed, you can compile the ebook in one step:
# Generate AsciiDoc and immediately compile to EPUB
$ colusa generate new_ebook.json --build epub
# Build multiple formats at once
$ colusa generate new_ebook.json --build epub --build html
# Compile an already-generated book (without re-downloading)
$ colusa build new_ebook.json --format epub
$ colusa build new_ebook.json # builds html, epub, and pdfcolusa uses asciidoctor, asciidoctor-epub3, and asciidoctor-pdf. Install them from:
- HTML: https://asciidoctor.org
- EPUB: https://asciidoctor.org/docs/asciidoctor-epub3/
- PDF: https://asciidoctor.org/docs/asciidoctor-pdf/
Use --dry-run to preview which extractor and transformer would be selected for each URL, without downloading anything or writing any files:
$ colusa generate new_ebook.json --dry-runExample output:
[dry-run] Config: new_ebook.json
[dry-run] Output dir: fsblog/
[dry-run] Total URLs: 3
[1/3] https://staffeng.com/guides/overview
Extractor : StaffEng (plugin)
Transformer: StaffEng (plugin)
[2/3] https://medium.com/@user/some-article
Extractor : Medium (plugin)
Transformer: Transformer (base)
[3/3] https://unknown-site.com/article
Extractor : Extractor (base)
Transformer: Transformer (base)
If a website is not in the supported list, you can define CSS-selector-based parsing rules directly in your book config without touching colusa's code. Dynamic rules are evaluated before built-in plugins; the first matching rule wins.
Add a site_rules list to your config:
{
"title": "My Ebook",
"author": "Me",
"version": "v1.0",
"homepage": "https://example.com",
"output_dir": "my_ebook",
"site_rules": [
{
"pattern": "//example.com",
"content": "article.post-body",
"title": "h1.article-title",
"author": ".author-name",
"published": "time.publish-date",
"cleanup": ["div.ads", "nav.sidebar"]
}
],
"urls": [
"https://example.com/some-article"
]
}| Field | Description |
|---|---|
pattern |
Regex matched against the full URL (e.g. //example.com) |
content |
CSS selector for the article body. Falls back to built-in detection if omitted or not found |
title |
CSS selector for the article title. Falls back to built-in defaults if omitted or not found |
author |
CSS selector for the author name. Falls back to built-in defaults if omitted or not found |
published |
CSS selector for the publish date. Falls back to built-in defaults if omitted or not found |
cleanup |
List of CSS selectors — matching elements are removed from the extracted content |
Rules can also be kept in a separate YAML or JSON file and shared across multiple book configs:
{
"site_rules_file": "./my-sites.yml"
}my-sites.yml:
- pattern: "//example.com"
content: article.post-body
title: h1.article-title
cleanup:
- div.ads
- nav.sidebarInline site_rules and site_rules_file are merged; inline rules are checked first. Relative paths in site_rules_file are resolved from the directory containing the config file.
Before generating ebook, we need to install asciidoctor tools. Follow install guideline on following websites:
- for generating html: https://asciidoctor.org
- for generating epub: https://asciidoctor.org/docs/asciidoctor-epub3/
- for generating pdf: https://asciidoctor.org/docs/asciidoctor-pdf/
To help with generating ebook, colusa also create a Makefile in the root folder of the ebook. In the Makefile, there are three common targets that we can use to generate ebook in html, epub, pdf formats.
# to generate html
$ make html
# to generate epub
$ make epub
# to generate pdf
$ make pdfGenerated ebooks will be saved to ./output folder.
user:output/ $ ls [10:55:55]
total 3056
drwxr-xr-x 10 320B images
-rw-r--r-- 1 699K index.epub
-rw-r--r-- 1 131K index.html
-rw-r--r--@ 1 694K index.pdf
Currently colusa has built-in support for the following websites. Any other site can be handled using dynamic site rules.
- https://untools.co
- https://unintendedconsequenc.es
- https://blog.acolyer.org
- https://fs.blog
- https://increment.com
- https://slack.engineering
- https://medium.com
- https://www.cs.rutgers.edu/~pxk/
- https://www.preethikasireddy.com
- https://engineering.atspotify.com
- https://truyenfull.vn
- https://avikdas.com
- https://www.infoq.com
Contribution is welcome. You can open issues to request for supporting more websites, open PR to help with those issues, or anything else like documentation, code contribution.