Summary

Summary is a python module to extract main content from the web page. This script was originally implemented by Nakatani Shuyo with ruby. But his script extracts text including garbage, e.g. comments of blog entry. So I improved not to extract that garbage as increasing of precision of calculating Layout Block, upon implementing this with python. In addition, I added function to extract specious and appropriate title, even if being not a standard coding about html document. Summary is useful, and its usage is so simple that you call 1 api. Now, let's dive into web.

Install

I prepares easy install way. Run the below command in your new terminal.

sudo easy_install "summary==0.1.1"

Usage

Here is how you get summary from the web page.

from summary import extract
uri = 'web page you want to extract main text.'
res = extract(urllib.urlopen(uri).read())
print res['title'], '\n' # header of main content
print res['body'], '\n' # main content
print res['img'], '\n' # candidates of main images

In addition, you are able to override is_collection_of_links() and not_body_rate(), which are methods of extractor module, in response to your necessary. is_collection_of_links() decides whether layout block is a collection of links.

# /path/to/any.py
# include `is_collection_of_links()` for overriding
from summary import extractor

def _is_collection_of_links(block):
    # whether this layout block is a collection of links.
    print block
    return False

extractor.is_collection_of_links = _is_collection_of_links

not_body_rate() decides the rate which means that layout block is not body.

# /path/to/any.py
# include `not_body_rate()` for overriding
from summary import extractor

def _not_body_rate(block):
    # not_body_rate() takes account of not_body_rate.
    # return value has to be float or integer.
    print block
    return 0

extractor.not_body_rate = _not_body_rate

Notes

Copyright of the original implementation

labs.cybozu.co.jp/blog/nakatani/2007/09/web_1.html

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
summary		summary
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TASK		TASK
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Install

Usage

Notes

Copyright of the original implementation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Summary

Install

Usage

Notes

Copyright of the original implementation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages