INFO I427 Search Informatics (3 CR)
Google under the hood

Assignment 4: Evaluating search engines

For this assignment you will evaluate two Web search engines, and compare their performance: (1) Yahoo, and (2) your own search engine built for this course. Evaluation is the last step toward building a search engine. You will build upon the crawler, indexer, and retrieval systems developed in the first three assignments. You are strongly advised to start working early. The assessments will be done by your fellow students, and you will submit the results. Our evaluation system will be relatively simple.

Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.

Experiment design and management

All the scripts you create should be world-executable and world-readable so the AI can see and run them, especially CGI scripts so your fellow students can evaluate your search engines. All the text and Berkeley DB files created should be world-readable so that the AI can inspect them. Also for this purpose the parent directories of your working directory should be world-executable. Work in your CGI directory for this assignment as explained under How do I run a CGI script over the Web? in the FAQ page. Rememeber that your data files need to reside in your directory under /var/www/data/.

Install the CGI script evaluate.cgi (available via Oncourse) that shows a set of queries and hits to a user (subject). The queries should be contained in a text file (call it queries.txt), one per line, and the name of the file is a parameter in the script. For each query, the script displays a set of hits, and for each hit it collects a relevance assessment from the user. Note that the hits are shuffled to avoid biasing the subject. It must be impossible during evaluation to determine the source of a page. The user can click on hit URLs, which will open in a new window to evaluate the relevance of each page.

Modify the evaluate.cgi script to enter your query file name, your search engine name, and names of DB_Files in which to store hits and relevance assessments. You can also modify the set of relevance scores and labels. Finally, fix the search1, search2 subroutines so that they use, respectively, your search engine and the Yahoo API to return a ranked list of hits based on those search engines. Note: since this script will be executed by the web server user (typically apache or www or httpd) rather than you, the server must have permissions to read/write any files and directories that the script has to access. Also, the files created by the script will be owned by the web server, so if you want access to those files for analysis later, you need to make sure that the script creates the files with appropriate permissions.

Prepare 5 queries in your queries file. These should be selected to make sure that your search engine covers relevant pages, i.e., there are some relevant pages among those you crawled.

For each query, show the subject a total of at most 10 pages, half of which must be retrieved by your search engine and half obtained from Yahoo. To obtain Yahoo hits, use Yahoo! Search Web Services via the Yahoo::Search Perl module for the Yahoo Search API (you need an AppID).

At least 3 fellow students should be enlisted to evaluate your search engines. You may use the discussion board to post a request for volunteer subjects. It is your responsibility to clearly instruct subjects as to how to run your evaluation. To ensure sufficient assessments are obtained, each student is required to evaluate 3 fellow students' search engines. Respond to a request for subjects on the discussion board to indicate that you are evaluating that system. All this will work only if everyone posts their request well ahead of the deadline. For this reason everyone is required to post requests 72 hours before the due date for the assignment. Doing this is crucial in order to complete the assignment and be graded. All assessments must be completed in one day, 48 hours before the due date. To ensure that all assessments are completed in time, each student's raw score for this assignment will be discounted by 30% for each assessment not completed 48 hours before the due date. For example, if you only complete 1 evaluation in time, you will get a 60% penalty!

Use the script rel_sets.pl (available via Oncourse) that uses the relevance assessments collected from subjects to construct a relevant set and tabulate results. The DB_Files with hits and relevance feedback must be passed to the scripts as command line arguments. Modify the rel_sets.pl script based on your choice of definition for consensus relevance --- for example, is a hit relevant if all, or a majority of the subjects label it as somewhat relevant? Or if just one subject labels it as very relevant?

Finally, modify the rel_sets.pl script and/or write a separate script that reads the output of the rel_sets.pl script (or use a spreadsheet application) to first compute precision and recall for each query as a function of rank level. For example, the output of the modified script might look like this for some query:

	rank    nrel    e1      prec    recl    e2      prec    recl
	----    ----    --      ----    ----    --      ----    ----
	1       2       0       0.00    0.00    1       1.00    0.50  
	2       2       1       0.50    0.50    1       1.00    1.00
	3       2       0       0.33    0.50    0       0.67    1.00
	...
	
Then aggregate the results across queries. Save the results to a text file and use a plotting application (e.g., excel or gnuplot) to build a plot with the precision-recall curves for the two engines. You may use 5- or 11-point averages with interpolation or per-rank averages across queries, as discussed in class. The final plot should be exported and saved to a file in PNG format and posted on your capricorn website (see instructions on readme file below).

What to turn in

Make sure your code is thoroughly debugged and tested. Always use the -w warning flag and the strict module.

Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.

Make sure your system works well. Test if thoroughly yourself before engaging your fellow students for actual assessments.

Place your scripts, query and data files, plot file, and any other support files created or used by your scripts in a directory named a4. Further, place in this directory an HTML file named a4-readme.html with concise but detailed documentation on your evaluation system: how you implemented them (eg what data structures), what parameters you used, what choices you made (eg what consensus rule for definition of relevance), etc. Link your plot file with precision-recall curves. Add one paragraph commenting on the result of your evaluation. It is important to also list (1) your 5 evaluation queries, (2) the names of the subjects who performed the relevance assessments to evaluate your system, and (3) the names of the students for whom you performed relevance assessments. Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.) -- note there is no penalty for credited sources but there is penalty for uncredited sources even if admissible. Include your full name and IU network ID. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is properly commented, writing the readme file should be a simple matter of collating all the comments from your code and formatting. All the documentation required in the readme file should already be contained in your code comments. Hint 2: use a text editor to write your documentation, not a WYSIWYG HTML editor, and especially not M$Word.

(Re)move any unnecessary temporary files from the a4 directory. Create a gzipped archive of the a4 directory using the tar czf command as usual (see Assignment 1), naming the resulting file a4-login.tgz where login is your username. Now upload the file a4-login.tgz to the A4 Drop Box on Oncourse by the deadline.

The assignment's raw score will be graded based on the following criteria:

Correctness
Does the code work as expected? Does it use reasonable algorithms and data structures as discussed in class? Did it produce reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully? Did you produce a plot with the two precision-recall curves?
Style
Is the code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation?
Documentation
Is the code thoroughly and clearly commented? Is there an adequate readme file?
May Laziness, Impatience, and Hubris be with you.