Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

The lecture discussed the basic architecture of a web crawler as shown in Figure

ID: 3909955 • Letter: T

Question

The lecture discussed the basic architecture of a web crawler as shown in Figure 1. This basic architecture is described in the IR Book and based on the design of the Mercator crawler2 that has formed the basis of number of research and commercial crawlers. Describe and explain the functionality and necessity for each of the following modules of the architecture. . The URL Frontier module The Fetch module The DNS resolution module The Parse module The module deciding if content is seen using document fingerprints The URL Filter module that utilizes robots.txt/templates The Duplicate URL Elimination module that utilizes an URL set Hint: consider the input and output data of each module, and how this data is manipulated by the module. You should aim to write no more than one paragraph per module. DNS Doc Robots.txtURL FP's Templates set URL Frontier

Explanation / Answer

NOTE: Due to time constraint and Chegg policy, I could answer only the first fourt parts of your question.

For the remaining three parts, please post a new question separately

Solution :-

1) The URL frontier at a node is given a URL by its crawl process (or by the host splitter of another crawl process). It maintains the URLs in the frontier and regurgitatesthem in some order whenever a crawler thread seeks a URL. Two important considerations govern the order in which URLs are returned by the frontier. First, high-quality pages that change frequently should be prioritized for frequent crawling. Thus, thepriority of a page should be a function of both its change rate and its quality (using some reasonable quality estimate). The combination is necessary because a large number of spam pages change completely on every fetch. The second consideration is politeness: we must avoid repeated fetch requests to a host within a short time span. The likelihood of this is exacerbated because ofa form of locality of reference: many URLs link to other URLs atthe same host. As a result, a URL frontier implemented as a simple priority queue might result in a burst of fetch requests to a host.This might occur even if we were to constrain the crawler so thatat most one thread could fetch from any single host at any time. Acommon heuristic is to insert a gap between successive fetchrequests to a host that is an order of magnitude larger than thetime taken for the most recent fetch from that host.

b) A fetch module uses the http protocolto retrieve the web page at a URL.

c) A DNS resolution moduledetermines the web server from which to fetch the page specified bya URL Each web server (and indeed any host connected to theinternet) has a unique IP address in textualform, translating it to an IP address (in this case,207.142.131.248) is a process known as DNSresolution or DNS lookup; here DNS standsfor Domain Name Service. During DNS resolution, theprogram that wishes to perform this translation (in our case, acomponent of the web crawler) contacts a DNSserver that returns the translated IP address. (Inpractice the entire translation may not occur at a single DNSserver; rather, the DNS server contacted initially may recursivelycall upon other DNS servers to complete the translation.) For amore complex URL such as en.wikipedia.org/wiki/Domain_Name_System,the crawler component responsible for DNS resolution extracts thehost name - in this case en.wikipedia.org - and looks up the IPaddress for the host en.wikipedia.org.

d) A parsing module extracts the text and set of links from afetched web page

Please Upvote my answer :)

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote