INFO I427 Search Informatics (3 CR)
Google under the hood

Assignment 2: Indexing the Web

For this assignment you will build a Web parser and indexer in Perl. This is the second step toward building a search engine. You will build upon the crawler developed for the first assignment, and in particular you will build an index of the pages already crawled and stored. You are strongly advised to start working early. Compared to the very efficient and robust indexing algorithms and data structures employed by commercial search engines, our system will be relatively simple.

Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.

Input, output and data structures

The indexer should take as command-line input:

  1. the name of a directory containing the crawled pages to be parsed and indexed, one per file
  2. the name of the Berkeley DB file mapping page filenames to URLs
  3. the name of a file containing a list of stopwords
  4. the name of a Berkeley DB file to be created for storing the the inverted index
  5. the name of a Berkeley DB file to be created for storing term frequencies
  6. the name of a Berkeley DB file to be created for storing inverse document frequencies
  7. the name of a Berkeley DB file to be created for storing outlinks
For example, you might run your indexer like this: index.pl pages pages.db stop.txt index.db tf.db idf.db links.db to mean: parse and index the pages in the directory pages, using the file-to-url map in pages.db and the stopwords in stop.txt; save the index, term frequencies, inverse document frequencies, and links in the Berkeley DB files index.db, tf.db, idf.db, and links.db respectively.

The file-to-url loopkup table should have been created in Assignment 1, and be a simple DB_File tied hash with filenames as keys and URLs as values. The stopword file could contain the noise words, say, one per line (remember to chomp!); you could for example use the one at http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words. Then you could create a hash with these words as keys, to quickly check if a particular term is a stopword.

Your program should consist of a simple loop over the files to be parsed and indexed (use glob for example). For each file, you have to parse the HTML code to extract terms and links. The module HTML::TokeParser (see class notes) should make this relatively easy. Remember to decode HTML entities with the module HTML::Entities.

As you extract terms and links from a page, you have to maintain a number of data structures. For links, you have to make sure the URLs are absolute and canonical (see Assignment 1). Then you would save them in a tied hash so that for each page you can later get a list of outgoing URLs. We will use this information to compute PageRank in a later assignment. Suppose you have the URLs from a page in a list. If you want to store a reference to this list as a value of a tied hash, this complex data structure must first be flattened.

The content (text) of the page needs to be further analyzed. Each term must first be screened to make sure it is not a noise word (see above), and then conflated. You can use the Lingua::Stem::En module, or another script implementing the Porter stemmer as discussed in class. The resulting stems should be used to maintain three data structures: term frequencies, inverse document frequencies, and the inverted index.

Inverse document frequency (IDF), which will be used in a later assignment, is computed from the number of pages in which a term occurs. To measure this you can use a global hash, tied to a DB_File. The keys are terms, and the values are counters of pages. Make sure to count each page only once if a term occurs multiple times in that page!

Term frequency (TF), which will also be used in later assignments, is the frequency of each term in a particular page. This should be normalized appropriately as discussed in class. It could be stored in a hash (for each page), with terms as keys and relative frequencies as values. To store this in a DB_File, the hash needs to be flattened as discussed above for links. Then the tied hash would have page filenames as keys and the flattened frequency hashes as values.

The inverted index is the data structure that will eventually be used to quikly find the pages containing any given term. For each term we want to keep track of all the pages in which the term occurs, and for each page, of all the positions where it occurs. One way to do this would be to maintain a hash with terms as keys. For an in-memory hash, the values would be references to hashes whose keys are pages and whose values are references to lists of positions. Again, since this complex data structure must be stored permanently into a DB_File, it must be flattened as discussed above.

The index script should be world-executable and world-readable so the AI can see and run it. All the Berkeley DB files created should be world-readable so that the AI can inspect them. Also for this purpose the parent directories of your working directory should be world-executable. Create a directory for this assignment under your home directory as follows:

	xavier:~> mkdir a2
	xavier:~> chmod a+rx a2
	xavier:~> cd a2
	xavier:~/a2> 
	
Then you would work in a2, placing there your output files. So, running your script would result in something like this:
	xavier:~/a2> ./index.pl	../a1/pages ../a1/pages.db stop.txt index.db tf.db idf.db links.db
	xavier:~/a2> ls 
	idf.db index.db index.pl links.db stop.txt tf.db 
	xavier:~/a2>
	

Problems, solutions, suggestions, warnings, tricks

One way to flatten complex data structures is with the MLDBM module; read the documentation of this useful module to see how it works and how to use it. Another approach is to do the flattening yourself; for example, you can flatten a list by joining its elements into a single string with a special separator charachter (eg, $;). Flattening a hash can be done in a similar way, since a hash is like a list of pairs. Flattening more complex data structures such as a hash of arrays is a bit more complicated and requires additional special characters. For example, a record might look like this: "$page1 $pos1 $pos2 ... $posN$;$page2 ...". This is all done automatically for you by MLDBM.

In later assignments, it will be useful to have quick access, in addition to TF and IDF data, also to the TF-IDF vector size of each page. You might consider doing this at indexing time and storing this information together with the TF data. For example, you might store the vector size (a scalar) as the first element of the TF data structure; the rest would be the actual flattened TF hash. Or, you could store the vector size in a separate DB_File. This is all optional for this particular assignment, as you can do it later from the TF and IDF data.

The IDF calculations for each term will be done in a later assignment from the document counts stored in the IDF data structure. You could even extract IDF data directly from the inverted index once this is complete; storing it explicitly is just a matter of efficiency.

You are not expected to worry about augmenting your index with anchor text from incoming links (from pages you crawled), although you are welcome to do this for extra credit if you wish.

What to turn in

Make sure your code is thoroughly debugged and tested. Always use the -w warning flag and the strict module.

Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.

Parse and index a total of 200 pages; these should be the pages crawled in the first assignment, or else 200 freshly crawled pages.

In addition to your index script, stop file, and output Berkely DB files, place in the a2 directory an HTML file named a2-readme.html with concise but detailed documentation on your parser/indexer: how you implemented it (eg what data structures), what choices you made (eg for flattening), etc. Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.) -- note there is no penalty for credited sources but there is penalty for uncredited sources even if admissible. Include your full name and IU network ID. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is properly commented, writing the readme file should be a simple matter of collating all the comments from your code and formatting. All the documentation required in the readme file should already be contained in your code comments. Hint 2: use a text editor to write your documentation, not a WYSIWYG HTML editor, and especially not M$Word.

(Re)move any other temporary files from the a2 directory. Create a gzipped archive of the a2 directory using the tar czf command as usual (see Assignment 1), naming the resulting file a2-login.tgz where login is your username. Now upload the file a2-login.tgz (in your home directory) to the A2 Drop Box on Oncourse by the deadline. Note: if the TGZ file is more than 10 MB, there may be a problem; see the instructor or AI or post a message on the discussion board well ahead of the deadline.

The assignment will be graded based on the following criteria:

Correctness
Does the code work as expected? Does it use reasonable algorithms and data structures as discussed in class? Did it produce reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully?
Style
Is the code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation?
Documentation
Is the code thoroughly and clearly commented? Is there an adequate readme file?
May Laziness, Impatience, and Hubris be with you.