For this assignment you will build a Web retrieval and ranking system in Perl. This is the third step toward building a search engine. You will build upon the crawler and indexer developed in the first two assignments, and in particular you will retrieve and rank relevant hits among those already crawled and stored, using the index already built. You are strongly advised to start working early. Compared to the very efficient and sophisticated retrieval and ranking algorithms and data structures employed by commercial search engines, our system will be relatively simple.
Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.
First, write a script that computes PageRank for the set of pages you crawled and indexed. This script should take two command line inputs:
pagerank.pl
links.db pr.db
to mean: use the outlinks in
links.db
, compute PageRanks, and store them in
pr.db
.
This script will have to determine the inlinks of each page from other pages in the crawl set. This means, first, eliminating dangling links to external (uncrawled) pages. Second, creating for each page a list of other pages linking to it. With this data structure, you can run the iterative PageRank algorithm as discussed in class until a suitable convergence criterion is met. For values of jump and convergence parameters use those discussed in class. Finally, store the PageRanks in a DB_File tied hash.
The retrieval and ranking system should take as command-line input a
query, composed of a list of keywords (query terms), as follows:
retrieve.pl kw1 kw2 kw3 ...
where kw1
, ...,
kwN
are N query terms.
Your script will also need access to a number of files and directories developed in prior assignments (file-to-url lookup table, index DB file, stopword list, TF and IDF DB files, etc.). These parameters can be specified in your script, but make sure they are clearly declared near the top of your script, so that the AI can change these parameters when grading.
The first step is to remove stopwords from the query, and stem the query terms (see Assignment 2). Then use the index to retrieve the set of pages (hits) matching the query. For extra credit, one could keep track of how many of the query terms are matched by each hit, so that later the pages that contain more of the query terms could be given higher priority. However this is not required. You can assume that a page is a hit if it contains at least one of the query terms.
Once you have a list of hits, create a vector representation for each hit page as well as for the query. For the hits, you can use the TF and IDF databases built in the second assignment to construct the TF-IDF vectors. A TF-IDF vector can be represented as a hash with terms (stems) as keys and weights as values. A hash of hashes is therefore a suitable data structure for the hit vectors.
For each hit page, (i) compute the cosine similarity between the hit
vector and the query vector and (ii) lookup the PageRank. Then combine
these two numbers. You could do a simple linear combination of the two,
of the type score = alpha * sim + (1 - alpha) * pagerank
where alpha
is a number between 0 and 1. Or you could
multiply the two, as in score = sim * pagerank
. Make sure
your combination function is well documented. This is what will
differentiate your search engine from others!
Finally, sort the hits by the resulting score and display them to the user (print to standard output). For each hit, display the rank, the score, and the URL (which you will look up from the file-to-url hash table).
Both scripts should be world-executable and world-readable so the AI can see and run them. All the Berkeley DB files created should be world-readable so that the AI can inspect them. Also for this purpose the parent directories of your working directory should be world-executable. Create a directory for this assignment under your home directory as follows:
xavier:~> mkdir a3 xavier:~> chmod a+rx a3 xavier:~> cd a3 xavier:~/a3>Then you would work in
a3
, placing there your
output files. So, running your scripts would result in something
like this:
xavier:~/a3> ./pagerank.pl ../a2/links.db pr.db xavier:~/a3> ./retrieve bart simpson 1. (0.73) http://www.thesimpsons.com/ 2. (0.68) http://www.springfield.com/ 3. (0.41) http://www.hedges.org/Simpsons/ ... xavier:~/a3>
For IDF, depending on what you have already computed and stored in the second assignment, you may need to do some work to transform the number of documents in which a term occurs into the actual IDF value of the term. For efficiency, compute the IDF value of each term once (offline) and store it in a tied hash.
For the computation of cosine similarities, you will need the TF-IDF vector size of each page. You may have computed this information at indexing time and stored it with the TF data. If not, do it now. It is less efficient to compute vector sizes of hit pages each time you compute cosine similarity.
Make sure your code is thoroughly debugged and tested. Always use the
-w
warning flag and the strict
module.
Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.
Make sure your system works well. Test if for several queries such that relevant pages are included in your index.
In addition to your PageRank and retrieval/ranking scripts, inlink
Berkeley DB file, and any other data files created or used by your
scripts, place in the a3
directory an HTML file named
a3-readme.html
with concise but detailed documentation on
your PageRank and retrieval/ranking scripts: how you implemented them (eg
what data structures), what parameters you used, what choices you made
(eg for combining PageRank and text similarity), etc. It is important to
also list a handful of sample queries that can be answered by your
system! Give credit to any source of assistance (students with whom you
discussed your assignments, instructors, books, online sources, etc.) --
note there is no penalty for credited sources but there is penalty for
uncredited sources even if admissible. Include your full name and IU
network ID. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is
properly commented, writing the readme file should be a simple matter of
collating all the comments from your code and formatting. All the
documentation required in the readme file should already be contained in
your code comments. Hint 2: use a text editor to write your
documentation, not a WYSIWYG HTML editor, and especially not M$Word.
(Re)move any unnecessary temporary files from the a3
directory. Create a gzipped archive of the a3
directory
using the tar czf
command as usual (see Assignment 1),
naming the resulting file a3-login.tgz
where
login
is your username. Now upload the file
a3-login.tgz
(in your home directory) to the A3 Drop Box on
Oncourse by the deadline.
The assignment will be graded based on the following criteria: