Skip to content

Commit f2f8c62

Browse files
Update README.md
1 parent 32b71e3 commit f2f8c62

1 file changed

Lines changed: 54 additions & 54 deletions

File tree

Web Crawler/README.md

Lines changed: 54 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -33,57 +33,57 @@ Something like this.</br>
3333
</ul>
3434
</br>
3535

36-
<h2>Constraints</h2>
37-
<ul>
38-
<li>Traffic is not evenly distributed</li>
39-
<content>May have some popular searches</content>
40-
</br>
41-
42-
<li>Need to have low latency</li>
43-
<content>Can we compromise on consistency ?</content>
44-
</br>
45-
46-
<li>Need to detect cycles</li>
47-
</br>
48-
49-
<li>Pages need to be crawled regularly to ensure freshness</li>
50-
<content>On an average 1ce per week</content>
51-
</ul>
52-
</br>
53-
54-
<h2>Scale</h2>
55-
<ul>
56-
<li>1 billion links to crawl</li>
57-
</br>
58-
59-
<li>100 billion searches per month</li>
60-
</ul>
61-
</br>
62-
63-
<h2>High Level design</h2>
64-
<img src="proxy.php?url=https%3A%2F%2Fgithub.com%2Fimg%2FHighLevelArchitecture.PNG" />
65-
</br>
66-
67-
<h2>Indivudial component design</h2>
68-
<h3>Web crawler</h3>
69-
<img src="proxy.php?url=https%3A%2F%2Fgithub.com%2Fimg%2FWebCrawler+Component.PNG" />
70-
71-
<h3>Class Design</h3>
72-
<pre><code>
73-
class Page(object):
74-
def __init__(self, url, title):
75-
self.title = title
76-
self.url = url
77-
self.timeStamp = DateTime.now()
78-
self.childUrls = []
79-
</code></pre>
80-
81-
<h2>Determining when to update the crawl results</h2>
82-
<p>We can have another micro service that perodically updates all the crawled pages thus updating timeStamp.</br>
83-
This service can update both pages and indexes database.
84-
</p>
85-
</br>
86-
87-
<h2>User inputs a search term and sees a list of relevant pages with titles and snippets</h2>
88-
<img src="proxy.php?url=https%3A%2F%2Fgithub.com%2Fimg%2FClientServerInteraction.PNG" />
89-
</br>
36+
<h2>Constraints</h2>
37+
<ul>
38+
<li>Traffic is not evenly distributed</li>
39+
<content>May have some popular searches</content>
40+
</br>
41+
42+
<li>Need to have low latency</li>
43+
<content>Can we compromise on consistency ?</content>
44+
</br>
45+
46+
<li>Need to detect cycles</li>
47+
</br>
48+
49+
<li>Pages need to be crawled regularly to ensure freshness</li>
50+
<content>On an average 1ce per week</content>
51+
</ul>
52+
</br>
53+
54+
<h2>Scale</h2>
55+
<ul>
56+
<li>1 billion links to crawl</li>
57+
</br>
58+
59+
<li>100 billion searches per month</li>
60+
</ul>
61+
</br>
62+
63+
<h2>High Level design</h2>
64+
<img src="img/HighLevelArchitecture.PNG" />
65+
</br>
66+
67+
<h2>Indivudial component design</h2>
68+
<h3>Web crawler</h3>
69+
<img src="img/WebCrawler Component.PNG" />
70+
</br>
71+
72+
<h3>Class Design</h3>
73+
<pre><code>
74+
class Page(object):
75+
def __init__(self, url, title):
76+
self.title = title
77+
self.url = url
78+
self.timeStamp = DateTime.now()
79+
self.childUrls = []
80+
</code></pre>
81+
82+
<h2>Determining when to update the crawl results</h2>
83+
<p>We can have another micro service that perodically updates all the crawled pages thus updating timeStamp.</br>
84+
This service can update both pages and indexes database.
85+
</p>
86+
</br>
87+
88+
<h2>User inputs a search term and sees a list of relevant pages with titles and snippets</h2>
89+
<img src="img/ClientServerInteraction.PNG" />

0 commit comments

Comments
 (0)