INFO I427 Search Informatics (3 CR)
Google under the hood

Assignment 3: Retrieving and ranking Web pages

For this assignment you will build a Web retrieval and ranking system in Perl. This is the third step toward building a search engine. You will build upon the crawler and indexer developed in the first two assignments, and in particular you will retrieve and rank relevant hits among those already crawled and stored, using the index already built. You are strongly advised to start working early. Compared to the very efficient and sophisticated retrieval and ranking algorithms and data structures employed by commercial search engines, our system will be relatively simple.

Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.

PageRank

First, write a script that computes PageRank for the set of pages you crawled and indexed. This script should take two command line inputs:

the name of a Berkeley DB file hashing pages to outlinks (this data was collected in the second assignment)
the name of a Berkeley DB file in which to save the PageRank of each page

For example, you might run your script like this:

pagerank.pl
	links.db pr.db

to mean: use the outlinks in links.db, compute PageRanks, and store them in pr.db.

This script will have to determine the inlinks of each page from other pages in the crawl set. This means, first, eliminating dangling links to external (uncrawled) pages. Second, creating for each page a list of other pages linking to it. With this data structure, you can run the iterative PageRank algorithm as discussed in class until a suitable convergence criterion is met. For values of jump and convergence parameters use those discussed in class. Finally, store the PageRanks in a DB_File tied hash.

Retrieval system

The retrieval and ranking system should take as command-line input a query, composed of a list of keywords (query terms), as follows: retrieve.pl kw1 kw2 kw3 ... where kw1, ..., kwN are N query terms.

Your script will also need access to a number of files and directories developed in prior assignments (file-to-url lookup table, index DB file, stopword list, TF and IDF DB files, etc.). These parameters can be specified in your script, but make sure they are clearly declared near the top of your script, so that the AI can change these parameters when grading.

The first step is to remove stopwords from the query, and stem the query terms (see Assignment 2). Then use the index to retrieve the set of pages (hits) matching the query. For extra credit, one could keep track of how many of the query terms are matched by each hit, so that later the pages that contain more of the query terms could be given higher priority. However this is not required. You can assume that a page is a hit if it contains at least one of the query terms.

Once you have a list of hits, create a vector representation for each hit page as well as for the query. For the hits, you can use the TF and IDF databases built in the second assignment to construct the TF-IDF vectors. A TF-IDF vector can be represented as a hash with terms (stems) as keys and weights as values. A hash of hashes is therefore a suitable data structure for the hit vectors.

For each hit page, (i) compute the cosine similarity between the hit vector and the query vector and (ii) lookup the PageRank. Then combine these two numbers. You could do a simple linear combination of the two, of the type score = alpha * sim + (1 - alpha) * pagerank where alpha is a number between 0 and 1. Or you could multiply the two, as in score = sim * pagerank. Make sure your combination function is well documented. This is what will differentiate your search engine from others!

Finally, sort the hits by the resulting score and display them to the user (print to standard output). For each hit, display the rank, the score, and the URL (which you will look up from the file-to-url hash table).

Both scripts should be world-executable and world-readable so the AI can see and run them. All the Berkeley DB files created should be world-readable so that the AI can inspect them. Also for this purpose the parent directories of your working directory should be world-executable. Create a directory for this assignment under your home directory as follows:

	xavier:~> mkdir a3
	xavier:~> chmod a+rx a3
	xavier:~> cd a3
	xavier:~/a3>

Then you would work in a3, placing there your output files. So, running your scripts would result in something like this:

	xavier:~/a3> ./pagerank.pl ../a2/links.db pr.db
	xavier:~/a3> ./retrieve bart simpson
	1. (0.73) http://www.thesimpsons.com/
	2. (0.68) http://www.springfield.com/
	3. (0.41) http://www.hedges.org/Simpsons/
	...
	xavier:~/a3>

Problems, solutions, suggestions, warnings, tricks

For IDF, depending on what you have already computed and stored in the second assignment, you may need to do some work to transform the number of documents in which a term occurs into the actual IDF value of the term. For efficiency, compute the IDF value of each term once (offline) and store it in a tied hash.

For the computation of cosine similarities, you will need the TF-IDF vector size of each page. You may have computed this information at indexing time and stored it with the TF data. If not, do it now. It is less efficient to compute vector sizes of hit pages each time you compute cosine similarity.

What to turn in

Make sure your code is thoroughly debugged and tested. Always use the -w warning flag and the strict module.

Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.

Make sure your system works well. Test if for several queries such that relevant pages are included in your index.

In addition to your PageRank and retrieval/ranking scripts, inlink Berkeley DB file, and any other data files created or used by your scripts, place in the a3 directory an HTML file named a3-readme.html with concise but detailed documentation on your PageRank and retrieval/ranking scripts: how you implemented them (eg what data structures), what parameters you used, what choices you made (eg for combining PageRank and text similarity), etc. It is important to also list a handful of sample queries that can be answered by your system! Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.) -- note there is no penalty for credited sources but there is penalty for uncredited sources even if admissible. Include your full name and IU network ID. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is properly commented, writing the readme file should be a simple matter of collating all the comments from your code and formatting. All the documentation required in the readme file should already be contained in your code comments. Hint 2: use a text editor to write your documentation, not a WYSIWYG HTML editor, and especially not M$Word.

(Re)move any unnecessary temporary files from the a3 directory. Create a gzipped archive of the a3 directory using the tar czf command as usual (see Assignment 1), naming the resulting file a3-login.tgz where login is your username. Now upload the file a3-login.tgz (in your home directory) to the A3 Drop Box on Oncourse by the deadline.

The assignment will be graded based on the following criteria:

Correctness: Does the code work as expected? Does it use reasonable algorithms and data structures as discussed in class? Did it produce reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully?
Style: Is the code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation?
Documentation: Is the code thoroughly and clearly commented? Is there an adequate readme file?