- Given a URL your application crawls it
- What do you mean by crawling ? The crawler goes to each and every link on a given page and collects all links on that page.
- Reverse indexes pages based on key words
- What is reverse indexing ? When the crawler is going through the webpage, it gets the keywords on that page, and then stores them.
- Generate title of the page and a small snippet of the page. Something like this.
- User searches a search term, gets a list of pages containing that search term
- The system needs to be higly available.
These key words are then used to find pages on the site when user performs a search.
- Traffic is not evenly distributed May have some popular searches
- Need to have low latency Can we compromise on consistency ?
- Need to detect cycles
- Pages need to be crawled regularly to ensure freshness On an average 1ce per week
- 1 billion links to crawl
- 100 billion searches per month
class Page(object):
def __init__(self, url, title):
self.title = title
self.url = url
self.timeStamp = DateTime.now()
self.childUrls = []
We can have another micro service that perodically updates all the crawled pages thus updating timeStamp.
This service can update both pages and indexes database.