python_LEARN/python/chapter01/link_crawler1.py at master · LemonLighter/python_LEARN

29 lines (23 loc) · 931 Bytes

from common import download
def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex
    crawl_queue = [seed_url] # the queue of URL's to download
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                # add this link to the crawl queue
                crawl_queue.append(link)
def get_links(html):
    """Return a list of links from html 
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)
if __name__ == '__main__':
    link_crawler('http://example.webscraping.com', '/(index|view)')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

link_crawler1.py

Latest commit

History

link_crawler1.py

File metadata and controls