Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Design a web crawler

Use Case

  • Given a URL your application crawls it
    • What do you mean by crawling ?
    • The crawler goes to each and every link on a given page and collects all links on that page.

  • Reverse indexes pages based on key words
    • What is reverse indexing ?
    • When the crawler is going through the webpage, it gets the keywords on that page, and then stores them.
      These key words are then used to find pages on the site when user performs a search.

  • Generate title of the page and a small snippet of the page.
  • Something like this.

  • User searches a search term, gets a list of pages containing that search term

  • The system needs to be higly available.

Constraints

  • Traffic is not evenly distributed
  • May have some popular searches
  • Need to have low latency
  • Can we compromise on consistency ?
  • Need to detect cycles

  • Pages need to be crawled regularly to ensure freshness
  • On an average 1ce per week

Scale

  • 1 billion links to crawl

  • 100 billion searches per month

High Level design


Indivudial component design

Web crawler


Class Design


class Page(object):
    def __init__(self, url, title):
        self.title = title
        self.url = url
        self.timeStamp = DateTime.now()
        self.childUrls = []

Determining when to update the crawl results

We can have another micro service that perodically updates all the crawled pages thus updating timeStamp.
This service can update both pages and indexes database.


User inputs a search term and sees a list of relevant pages with titles and snippets