READ MF!
Requirements
API
/crawl
/search
Estimation
Storage
Assume 1 billion web pages are downloaded every month. Each page is 500KB. So 1 Billion * 500K = 500TB/month. So 1 year is 500 * 12 = 6PB/year and 30PB/5 years.
1B /30 days/90000 secs = 400pages/sec
Database
Design
- Search Engine
- Web archiving
- Web mining
- Web monitoring
DFS vs BFS for Crawling
- We can crawl sub URL’s by DFS (depth first search) or BFS (breadth first search) starting from root page and traverse inside. BFS with queue is recommended here so that we can keep adding the URL’s to the queue as FIFO (First in First Out).
Seed URL
A web crawler starts with a URL to visit, called the seeds.
Crawl Frontier
Ensures politeness and prioritization.
Sending too many requests is considered impolite and even a DOS attack. General idea is to download one game at a time from the same host. A delay can be added between the two downloads. Each downloader consumes only from one partition like Kafka and has a delay between two messages.
The Crawl-delay directive. In case it was absent, subsequent requests to the same domain needed to be spaced out by a conservative number of seconds (e.g. 15 seconds). This was to ensure that the crawler didn’t cause an unexpected load on websites.
URLs are prioritized based on their usefulness (Google PageRank). Topics with higher priority are consumed first. Larry Page once described the perfect search engine as understanding exactly what you mean and giving you back exactly what you want.
Front Queues are for prioritization and back queues are for politeness.
URL Fetcher
Given a URL from URL Frontier identify the DNS and later fetch the IP and pass it to the URL Renderer. Here we can also validate against the robot.txt to make certain parts of the website’s are not crawled as per the robots.txt.
Custom DNS Server
DNS name resolution will be a big bottleneck of our crawlers given the number of URLs we will be working with. To avoid repeated requests, we can start caching DNS results by building our local DNS server.
URL Renderer
URL Renderer download’s the web document corresponding to a given URL using the appropriate network protocol like HTTP (Here we can discuss on if we have to use any other protocol like FTP). Here instead of downloading a document multiple times, we Cache the document locally.
Content Extractor (Document Stream Reader)
We can use an input stream reader that reads the entire content of the document downloaded from the URL Renderer and can also provide a method to re-read the same document if we pass the same URL. The Content Extractor extracts the content and passes the content (Rabbit MQ or Kafka) to Content Filter Detection and URL Filter Detection
Content Filter Detection (Content De-dupe)
Content Filter detects duplicate content and also filters sensitive content which we do not want to store in our data store. Some potential algorithms are the Jaccard index and cosine similarity. Here we are only filtering text content from the document.
URL Filter Detection (URL De-dupe)
Similar to Content Filtering URL filter is filtering the URL from the given document. We need to be careful the web crawler doesn’t get stuck in an infinite loop, which happens when the graph contains a cycle. This can also be used to blacklist URLs.
Reverse Index Search
Based on the document content. Reverse Index Search will remove markup breaks up the text into multiple terms. It also normalizes the content and converts the query to use boolean operators. Once we have the query it can be used to retrieve the page and snippet given a text.