INFO I427 Search Informatics (3 CR)
Google under the hood

Assignment 1: Crawling the Web

For this assignment you will build a Web crawler (a.k.a. robot) in Perl. This is the first step toward building a search engine. It is also the most delicate and probably the most difficult step, so you are strongly advised to start working early. Compared to powerful, sophisticated, massively parallel crawlers such as Googlebot, our crawler will be relatively simple and true to its name when it comes to speed.

Read and follow the directions in this document very carefully, as for this first assignment they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.

Input, output and data structures

The crawler should take as command-line input:

the name of a file containing a list of seed URLs
the maximum total number of pages to crawl (an integer)
the name of a directory in which to save the crawled pages, one per file; this name should also be used to contruct the name of a file mapping pages to URLs (see below).

For example, you might run your crawler like this:

crawl.pl
	seeds.txt 200 pages

to mean: crawl 200 pages from the seed URLs in the file seeds.txt and save these pages in the directory pages.

The seed file should contain URLs of pages you are interested in, with one URL per line, like this:

	http://www.thesimpsons.com/
	http://homerforpresident.tripod.com/
	http://www.snpp.com/
	http://www.springfield.com/
	http://www.imdb.com/title/tt0096697/
	http://www.hedges.org/Simpsons/
	...

Read about the different ways in which 'newline' characters are encoded in UNIX (including MacOSX), DOS/Windows, and MacOS9 files. Use UNIX newlines to prevent problems. Make sure you chomp newlines when you read the file.

A good idea would be to maintain a queue on unvisited pages (URLs), implemented as a FIFO list. You initialize this with the seeds, shift URLs to crawl from the front of the queue, and push newly parsed URLs into the end of the queue. Stop when the queue is empty or you have visited the maximum number of pages, whichever occurs first. You might find links to pages you have already crawled, and should not crawl these again. For this purpose, implement a lookup table of visited URLs. You might want to implement this as an associative array for efficient access, inserting visited URLs as keys and checking if a URL exists in the hash before inserting it into the queue. In your final project, you will index and search the crawled pages. Therefore you will need a map from page filenames to URLs. For this purpose, you should maintain and store on disk a simple table to lookup the URL corresponding to a given page filename. Hint: an easy way to do this is to build a reverse hash from the visited URLs lookup hash, and tie it to a Berkeley DB file using the DB_File module.

The crawler script should be world-executable and world-readable so the AI can see and run it. If the results directory (to store the pages) does not exist, it should be created by your crawler. The directory and the page files in it should be world-readable so that the AI can see them. Also for this purpose the parent directories of your results directory should be world-executable. Create a directory for this assignment under your home directory as follows:

	xavier:~> mkdir a1
	xavier:~> chmod a+rx a1
	xavier:~> cd a1
	xavier:~/a1>

Then you would work in a1, placing there your script, input file and output subdirectory. As a convention, name the files containing your crawled pages with increasing integers as follows: 1, 2, etc. So, running your script would result in something like this:

	xavier:~/a1> ./crawl.pl seeds.txt 10 pages
	xavier:~/a1> ls pages
	1  10  2  3  4  5  6  7  8  9
	xavier:~/a1>

Problems, solutions, suggestions, warnings, tricks

You will want to use the modules in the LWP library to make it easy to build your crawler. To get started, read the documentation of LWP and LWP::Simple, and the recipes in lwpcook (all available of course via perldoc). You will find examples on how to fetch a Web page given its URL, recognize broken links, and extract URLs from a page.

An important task for your crawler is to recognize relative and absolute URLs, and transform the relative URLs into absolute URLs (as needed for fetching). For example, if the page http://www.cnn.com/ contains a link whose anchor tag a has its href element with value /linkto/intl.html, this is a relative URL that must be combined with the base URL to obtain the absolute version http://www.cnn.com/linkto/intl.html. Another problem is that there may be many different URLs that all really correspond to the same page. To recognize this and avoid fetching multiple copies of the same page, you should transform absolute URLs into a canonical form. For example, the absolute URLs http://www.cnn.com/TECH, http://WWW.CNN.COM/TECH/, http://www.cnn.com/TECH/index.html, and http://www.cnn.com/bogus/../TECH/ should all be made into canonical form http://www.cnn.com/TECH/. Fortunately, the URI module makes it relatively easy for you to deal with all these issues. Only absolute, canonical URLs should be used by the crawler in its queue and lookup tables. (Note: there may also be mirrored pages, ie distinct pages with the same content; we will not worry about mirrored pages in this assignment.)

Your crawler should behave in an ethical and polite way, ie avoid placing undue load on any one Web server's bandwidth. For this purpose, you must avoid sending too many requests in rapid succession to a server. Furthermore, your crawler should obey the Robots Exclusion Protocol by which a webmaster may elect to exclude any crawler from all or parts of a site. Fortunately, the LWP::RobotUA module makes it easy for you to deal with all these issues. Make sure your crawler identifies itself and its author (you) using the User-Agent and From HTTP request headers. As an agent name (User-Agent), use IUB-class-login where class is the course number and term (eg INFO-I427-S09) and login is your IU username (network ID). As an email (From), use your full IU email address, eg login@indiana.edu. Note that compliance with these identification steps and the Robots Exclusion Protocol is an absolute requirement for this course's assignments/project, as well as part of the IU IT use policy. Lack of compliance is likely to cause serious problems for our network administrators and consequently for the author of the non-compliant crawler. Statistically, every time I teach a course on crawlers I have one case of a careless student who gets in trouble. Don't let that be you.

There are a few other decisions to make when you write your crawler:

Follow HTTP and HTML redirections? These are not strict requirements but nice things to learn and implement.
Which protocols to use? For this assignments it seems reasonable to only follow URLs with the HTTP protocol (http scheme).
Which file types to fetch? For this assignment it seems reasonable to only crawl HTML pages, recognized by the HTTP header Content-Type with value text/html and file extensions html, htm, HTML, etc.
Exclude dynamic content? It's OK to try, for example filtering away URLs that match the string cgi-bin or the character ?, although this is a somewhat futile exercise. (Why?)

For further information on crawlers and the LWP library, in addition to your class notes and our textbook, you may read this paper, this book, and/or find countless additional resources on the Web.

What to turn in

Make sure your code is thoroughly debugged and tested. Always use the -w warning flag and the strict module.

Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.

Crawl a total of 200 pages using your favorite 10 seed URLs as seeds. If your script terminates (runs out of URLs) before you crawl 200 pages, add some seeds and retry. Name the results subdirectory a1/pages (this is the directory with the page files 1 ... 200).

In addition to your crawler script, seed file, URL lookup table (DB) file, and results subdirectory, place in the a1 directory an HTML file named a1-readme.html with concise but detailed documentation on your crawler: how you implemented it (eg what data structures), what choices you made (eg with respect to file types and redirection), how you complied with the Robots Exclusion Protocol, what is the name of your User-Agent, etc. Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.) -- note there is no penalty for credited sources but there is penalty for uncredited sources even if admissible. Include your full name and IU network ID. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is properly commented, writing the readme file should be a simple matter of collating all the comments from your code and formatting. All the documentation required in the readme file should already be contained in your code comments. Hint 2: use a text editor to write your documentation, not a WYSIWYG HTML editor, and especially not M$Word.

(Re)move any other temporary files from the a1 directory. Create a gzipped archive of the a1 directory using the tar command as follows:

	xavier:~/a1> ls
	a1-readme.html crawler.pl pages pages.DB seeds.txt 
	xavier:~/a1> cd ..
	xavier:~> tar czf a1-login.tgz a1
	xavier:~>

where login is your username. Now upload the file a1-login.tgz (in your home directory) to the A1 Drop Box on Oncourse by the deadline. Note: if the TGZ file is more than 10 MB, there may be a problem; see the instructor or AI or post a message on the discussion board well ahead of the deadline.

The assignment will be graded based on the following criteria:

Correctness: Does the code work as expected? Does it use reasonable algorithms and data structures as discussed in class? Did it produce reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully?
Style: Is the code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation?
Documentation: Is the code thoroughly and clearly commented? Is there an adequate readme file?

May Laziness, Impatience, and Hubris be with you.

INFO I427 Search Informatics (3 CR) Google under the hood

Assignment 1: Crawling the Web

Input, output and data structures

Problems, solutions, suggestions, warnings, tricks

What to turn in

INFO I427 Search Informatics (3 CR)
Google under the hood