Crawling and Indexing the Web

Here we are going to see about Crawling and Indexing the Web which is given as follows ,

Crawling the Web

The crawler begins with one or more URLs that constitute a seed set. It picks a URL from this seed set, then fetches the web page at that URL. The fetched page is then parsed, to extract both the text and the links from the page (each of which points to another URL).

• To make Web search efficient search engines collect web documents and index them by the words (terms) they contain.
• For the purposes of indexing web pages are first collected and stored in a local repository
• Web crawlers (also called spiders or robots) are programs that systematically and exhaustively browse the Web and store all visited pages
• Crawlers follow the hyperlinks in the Web documents implementing graph search algorithms like depth-first and breadth-first
Breadth-first Web crawling limited to depth 3

Issues in Web Crawling:
• Network latency (multithreading)
• Address resolution (DNS caching)
• Extracting URLs (use canonical form)
• Managing a huge web page repository
• Updating indices
• Responding to constantly changing Web
• Interaction of Web page developers
• Advanced crawling by guided (informed) search (using web page ranks)

Indexing and Keyword Search :
We need efficient content-based access to Web documents
• Document representation:
– Term-document matrix (inverted index)
• Relevance ranking:
– Vector space model