Web Crawler

Design a web crawler

Use Case

Given a URL your application crawls it

What do you mean by crawling ?

Reverse indexes pages based on key words

What is reverse indexing ?

Generate title of the page and a small snippet of the page.

User searches a search term, gets a list of pages containing that search term

The system needs to be higly available.

Constraints

Traffic is not evenly distributed

Need to have low latency

Need to detect cycles

Pages need to be crawled regularly to ensure freshness

Scale

1 billion links to crawl

100 billion searches per month

High Level design

Indivudial component design

Web crawler

Class Design


class Page(object):
    def __init__(self, url, title):
        self.title = title
        self.url = url
        self.timeStamp = DateTime.now()
        self.childUrls = []

Determining when to update the crawl results

We can have another micro service that perodically updates all the crawled pages thus updating timeStamp.
This service can update both pages and indexes database.

Name		Name	Last commit message	Last commit date
parent directory ..
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Design a web crawler

Use Case

Constraints

Scale

High Level design

Indivudial component design

Web crawler

Class Design

Determining when to update the crawl results

User inputs a search term and sees a list of relevant pages with titles and snippets

FilesExpand file tree

Web Crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

Web Crawler

Folders and files

parent directory

README.md

Design a web crawler

Use Case

Constraints

Scale

High Level design

Indivudial component design

Web crawler

Class Design

Determining when to update the crawl results

User inputs a search term and sees a list of relevant pages with titles and snippets