For this assignment you will build a Web parser and indexer in Perl. This is the second step toward building a search engine. You will build upon the crawler developed for the first assignment, and in particular you will build an index of the pages already crawled and stored. You are strongly advised to start working early. Compared to the very efficient and robust indexing algorithms and data structures employed by commercial search engines, our system will be relatively simple.
Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.
The indexer should take as command-line input:
index.pl
pages pages.db stop.txt index.db tf.db idf.db links.db
to mean:
parse and index the pages in the directory pages
, using the
file-to-url map in pages.db
and the stopwords in
stop.txt
; save the index, term frequencies, inverse document
frequencies, and links in the Berkeley DB files index.db
,
tf.db
, idf.db
, and links.db
respectively.
The file-to-url loopkup table should have been created in Assignment 1, and be a simple DB_File tied hash with filenames as keys and URLs as values. The stopword file could contain the noise words, say, one per line (remember to chomp!); you could for example use the one at http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words. Then you could create a hash with these words as keys, to quickly check if a particular term is a stopword.
Your program should consist of a simple loop over the files to be
parsed and indexed (use glob
for example). For each file,
you have to parse the HTML code to extract terms and links. The
module HTML::TokeParser
(see class notes) should make
this relatively easy. Remember to decode HTML entities with the module
HTML::Entities
.
As you extract terms and links from a page, you have to maintain a number of data structures. For links, you have to make sure the URLs are absolute and canonical (see Assignment 1). Then you would save them in a tied hash so that for each page you can later get a list of outgoing URLs. We will use this information to compute PageRank in a later assignment. Suppose you have the URLs from a page in a list. If you want to store a reference to this list as a value of a tied hash, this complex data structure must first be flattened.
The content (text) of the page needs to be further analyzed. Each term
must first be screened to make sure it is not a noise word (see above),
and then conflated. You can use the Lingua::Stem::En
module, or another script implementing the Porter stemmer
as discussed in class. The resulting stems should be used to maintain
three data structures: term frequencies, inverse document frequencies,
and the inverted index.
Inverse document frequency (IDF), which will be used in a later assignment, is computed from the number of pages in which a term occurs. To measure this you can use a global hash, tied to a DB_File. The keys are terms, and the values are counters of pages. Make sure to count each page only once if a term occurs multiple times in that page!
Term frequency (TF), which will also be used in later assignments, is the frequency of each term in a particular page. This should be normalized appropriately as discussed in class. It could be stored in a hash (for each page), with terms as keys and relative frequencies as values. To store this in a DB_File, the hash needs to be flattened as discussed above for links. Then the tied hash would have page filenames as keys and the flattened frequency hashes as values.
The inverted index is the data structure that will eventually be used to quikly find the pages containing any given term. For each term we want to keep track of all the pages in which the term occurs, and for each page, of all the positions where it occurs. One way to do this would be to maintain a hash with terms as keys. For an in-memory hash, the values would be references to hashes whose keys are pages and whose values are references to lists of positions. Again, since this complex data structure must be stored permanently into a DB_File, it must be flattened as discussed above.
The index script should be world-executable and world-readable so the AI can see and run it. All the Berkeley DB files created should be world-readable so that the AI can inspect them. Also for this purpose the parent directories of your working directory should be world-executable. Create a directory for this assignment under your home directory as follows:
xavier:~> mkdir a2 xavier:~> chmod a+rx a2 xavier:~> cd a2 xavier:~/a2>Then you would work in
a2
, placing there your
output files. So, running your script would result in something
like this:
xavier:~/a2> ./index.pl ../a1/pages ../a1/pages.db stop.txt index.db tf.db idf.db links.db xavier:~/a2> ls idf.db index.db index.pl links.db stop.txt tf.db xavier:~/a2>
One way to flatten complex data structures is with the
MLDBM
module; read the documentation of this useful module
to see how it works and how to use it. Another approach is to do the
flattening yourself; for example, you can flatten a list by joining its
elements into a single string with a special separator charachter (eg,
$;
). Flattening a hash can be done in a similar way, since
a hash is like a list of pairs. Flattening more complex data structures
such as a hash of arrays is a bit more complicated and requires
additional special characters. For example, a record might look like this:
"$page1 $pos1 $pos2 ... $posN$;$page2 ..."
. This is
all done automatically for you by MLDBM
.
In later assignments, it will be useful to have quick access, in addition to TF and IDF data, also to the TF-IDF vector size of each page. You might consider doing this at indexing time and storing this information together with the TF data. For example, you might store the vector size (a scalar) as the first element of the TF data structure; the rest would be the actual flattened TF hash. Or, you could store the vector size in a separate DB_File. This is all optional for this particular assignment, as you can do it later from the TF and IDF data.
The IDF calculations for each term will be done in a later assignment from the document counts stored in the IDF data structure. You could even extract IDF data directly from the inverted index once this is complete; storing it explicitly is just a matter of efficiency.
You are not expected to worry about augmenting your index with anchor text from incoming links (from pages you crawled), although you are welcome to do this for extra credit if you wish.
Make sure your code is thoroughly debugged and tested. Always use the
-w
warning flag and the strict
module.
Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.
Parse and index a total of 200 pages; these should be the pages crawled in the first assignment, or else 200 freshly crawled pages.
In addition to your index script, stop file, and output Berkely DB
files, place in the a2
directory an HTML file named
a2-readme.html
with concise but detailed documentation on
your parser/indexer: how you implemented it (eg what data structures),
what choices you made (eg for flattening), etc. Give credit to any
source of assistance (students with whom you discussed your assignments,
instructors, books, online sources, etc.) -- note there is no penalty
for credited sources but there is penalty for uncredited sources even if
admissible. Include your full name and IU network ID. Make sure this
documentation is properly formatted as valid XHTML. Hint 1: if your code is
properly commented, writing the readme file should be a simple matter of
collating all the comments from your code and formatting. All the
documentation required in the readme file should already be contained in
your code comments. Hint 2: use a text editor to write your
documentation, not a WYSIWYG HTML editor, and especially not M$Word.
(Re)move any other temporary files from the a2
directory.
Create a gzipped archive of the a2
directory using the
tar czf
command as usual (see Assignment 1), naming the
resulting file a2-login.tgz
where login
is
your username. Now upload the file a2-login.tgz
(in your
home directory) to the A2 Drop Box on Oncourse by the deadline. Note: if
the TGZ file is more than 10 MB, there may be a problem; see the
instructor or AI or post a message on the discussion board well ahead of
the deadline.
The assignment will be graded based on the following criteria: