For this assignment you will build a Web crawler (a.k.a. robot) in Perl. This is the first step toward building a search engine. It is also the most delicate and probably the most difficult step, so you are strongly advised to start working early. Compared to powerful, sophisticated, massively parallel crawlers such as Googlebot, our crawler will be relatively simple and true to its name when it comes to speed.
Read and follow the directions in this document very carefully, as for this first assignment they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.
The crawler should take as command-line input:
crawl.pl
seeds.txt 200 pages
to mean: crawl 200 pages from the seed
URLs in the file seeds.txt
and save these pages in the
directory pages
.
The seed file should contain URLs of pages you are interested in, with one URL per line, like this:
http://www.thesimpsons.com/ http://homerforpresident.tripod.com/ http://www.snpp.com/ http://www.springfield.com/ http://www.imdb.com/title/tt0096697/ http://www.hedges.org/Simpsons/ ...Read about the different ways in which 'newline' characters are encoded in UNIX (including MacOSX), DOS/Windows, and MacOS9 files. Use UNIX newlines to prevent problems. Make sure you chomp newlines when you read the file.
A good idea would be to maintain a queue on unvisited pages (URLs),
implemented as a FIFO list. You initialize this with the seeds, shift
URLs to crawl from the front of the queue, and push newly parsed URLs
into the end of the queue. Stop when the queue is empty or you have
visited the maximum number of pages, whichever occurs first. You might
find links to pages you have already crawled, and should not crawl these
again. For this purpose, implement a lookup table of visited URLs. You
might want to implement this as an associative array for efficient
access, inserting visited URLs as keys and checking if a URL exists in
the hash before inserting it into the queue. In your final project, you
will index and search the crawled pages. Therefore you will need a map
from page filenames to URLs. For this purpose, you should maintain and
store on disk a simple table to lookup the URL corresponding to a given
page filename. Hint: an easy way to do this is to build a reverse hash
from the visited URLs lookup hash, and tie it to a Berkeley DB file using
the DB_File
module.
The crawler script should be world-executable and world-readable so the AI can see and run it. If the results directory (to store the pages) does not exist, it should be created by your crawler. The directory and the page files in it should be world-readable so that the AI can see them. Also for this purpose the parent directories of your results directory should be world-executable. Create a directory for this assignment under your home directory as follows:
xavier:~> mkdir a1 xavier:~> chmod a+rx a1 xavier:~> cd a1 xavier:~/a1>Then you would work in
a1
, placing there your script, input
file and output subdirectory. As a convention, name the files containing
your crawled pages with increasing integers as follows: 1
,
2
, etc. So, running your script would result in something
like this:
xavier:~/a1> ./crawl.pl seeds.txt 10 pages xavier:~/a1> ls pages 1 10 2 3 4 5 6 7 8 9 xavier:~/a1>
You will want to use the modules in the LWP library to make it easy to
build your crawler. To get started, read the documentation of LWP and LWP::Simple,
and the recipes in lwpcook (all
available of course via perldoc
). You will find examples on
how to fetch a Web page given its URL, recognize broken links, and extract
URLs from a page.
An important task for your crawler is to recognize relative and absolute
URLs, and transform the relative URLs into absolute URLs (as needed for
fetching). For example, if the page http://www.cnn.com/
contains a link whose anchor tag a
has its
href
element with value /linkto/intl.html
,
this is a relative URL that must be combined with the base URL to obtain
the absolute version http://www.cnn.com/linkto/intl.html
.
Another problem is that there may be many different URLs that all really
correspond to the same page. To recognize this and avoid fetching
multiple copies of the same page, you should transform absolute URLs
into a canonical form. For example, the absolute URLs
http://www.cnn.com/TECH
,
http://WWW.CNN.COM/TECH/
,
http://www.cnn.com/TECH/index.html
, and
http://www.cnn.com/bogus/../TECH/
should all be made into
canonical form http://www.cnn.com/TECH/
. Fortunately, the
URI module
makes it relatively easy for you to deal with all these issues. Only
absolute, canonical URLs should be used by the crawler in its queue
and lookup tables. (Note: there may also be mirrored pages, ie distinct
pages with the same content; we will not worry about mirrored pages in
this assignment.)
Your crawler should behave in an ethical and polite way, ie avoid
placing undue load on any one Web server's bandwidth. For this purpose,
you must avoid sending too many requests in rapid succession to a
server. Furthermore, your crawler should obey the Robots Exclusion
Protocol by which a webmaster may elect to exclude any crawler from
all or parts of a site. Fortunately, the LWP::RobotUA
module makes it easy for you to deal with all these issues. Make sure
your crawler identifies itself and its author (you) using the
User-Agent
and From
HTTP request headers. As
an agent name (User-Agent
), use
IUB-class-login
where class
is the course
number and term (eg INFO-I427-S09
) and login
is
your IU username (network ID). As an email (From
), use your
full IU email address, eg login@indiana.edu
. Note that
compliance with these identification steps and the Robots Exclusion
Protocol is an absolute requirement for this course's
assignments/project, as well as part of the IU IT use policy. Lack
of compliance is likely to cause serious problems for our network
administrators and consequently for the author of the non-compliant
crawler. Statistically, every time I teach a course on crawlers I have
one case of a careless student who gets in trouble. Don't let that be
you.
There are a few other decisions to make when you write your crawler:
http
scheme).
Content-Type
with value text/html
and file
extensions html
, htm
, HTML
,
etc.
cgi-bin
or
the character ?
, although this is a somewhat futile
exercise. (Why?)
Make sure your code is thoroughly debugged and tested. Always use the
-w
warning flag and the strict
module.
Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.
Crawl a total of 200 pages using your favorite 10 seed URLs as seeds.
If your script terminates (runs out of URLs) before you crawl 200 pages,
add some seeds and retry. Name the results subdirectory
a1/pages
(this is the directory with the page files
1
... 200
).
In addition to your crawler script, seed file, URL lookup table (DB)
file, and results subdirectory, place in the a1
directory
an HTML file named a1-readme.html
with concise but detailed
documentation on your crawler: how you implemented it (eg what data
structures), what choices you made (eg with respect to file types and
redirection), how you complied with the Robots Exclusion Protocol, what
is the name of your User-Agent, etc. Give credit to any source of
assistance (students with whom you discussed your assignments,
instructors, books, online sources, etc.) -- note there is no penalty
for credited sources but there is penalty for uncredited sources even if
admissible. Include your full name and IU network ID. Make sure this
documentation is properly formatted as valid XHTML. Hint 1: if your code is
properly commented, writing the readme file should be a simple matter of
collating all the comments from your code and formatting. All the
documentation required in the readme file should already be contained in
your code comments. Hint 2: use a text editor to write your
documentation, not a WYSIWYG HTML editor, and especially not M$Word.
(Re)move any other temporary files from the a1
directory.
Create a gzipped archive of the a1
directory using the
tar
command as follows:
xavier:~/a1> ls a1-readme.html crawler.pl pages pages.DB seeds.txt xavier:~/a1> cd .. xavier:~> tar czf a1-login.tgz a1 xavier:~>where
login
is your username. Now upload the file
a1-login.tgz
(in your home directory) to the A1 Drop Box on
Oncourse by the deadline. Note: if the TGZ file is more than 10 MB,
there may be a problem; see the instructor or AI or post a message on
the discussion board well ahead of the deadline.
The assignment will be graded based on the following criteria: