INFO I427 Search Informatics (3 CR)
Google under the hood

Final Project: A complete search engine!

For the final project, you will pull together the components that you have developed in the assignments throughout the semester and build a real search engine, in Perl. If the components you have work well, the project should consist mostly of gluing them together and adding a Web interface. Compared to commercial search engines, yours will be quite simplistic and small scale, but it should work.

You may work in pairs. No team can have more than two people. It is your responsibility to find a partner, if you decide to work in a team. If you work in a team, you can integrate the best (more efficient, robust, etc.) components developed by the team members throughout the semester. Each team should identify a primary team member for the purpose of turning in the assignment.

You are strongly advised to start working early. Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.

Components

Crawler

Use the crawler you developed in the first assignment to crawl a small portion of the Web (say, 1000 pages). You are encouraged to modify your crawler to make it more efficient. For example, you might shuffle the frontier to minimize delay imposed by the RobotUA module for successive requests to the same server. As seed pages, use some pages about a topic of your interest. For example, if you start from informatics websites, you can name your search engine someghing like Informoogle.

Indexer

Use the indexer you developed in the second assignment to parse your crawled pages, filter stop words, stem, extract links, and build the inverted index and various other data structures needed for text analysis. Additionally, you should extract the titles from the crawled pages. This is done by a small extension to the parser, which would save the title of each page in a tied (DB_File) hash. If the title is missing in a page, use untitled or the URL of the page as a title.

Link analysis

Use the PageRank algorithm (developed in the third assignment) to compute the PageRank of each page in your index, and store this in a Berkeley DB file.

Web interface

Use the CGI module to develop a simple, clean, friendly Web site that will be the front-end of your search engine. This will run in your CGI directory on Capricorn. The interface should have a logo, a text field for the query, a search button, and a link to online (XHTML) documentation. When the user submits a query, s/he will be presented with a ranked list of hits. Each hit should display the title of the page, its URL, and a link to the page itself. This information is available from the Berkeley DB files created by your crawler and indexer. If there are too many hits (say more than 100), display only the first 100. For extra credit, you can display just 10 hits per screen if there are more than 10 hits. The user would then browse through the results by clicking on appropriate navigational links on each screen.

Retrieval and ranking

When the user submits a query to your search engine, you should use the scripts developed in the third assignment to build a vector representation of the query (as you did for pages) and then present to the user a ranked list of hits. In particular, use the index to retrieve the hit list (list of pages matching the query). For each page in the hit list, compute the text (cosine) similarity between the page and query TFIDF vectors. Then combine this similarity with the PageRank of each hit; how you do this is the "secret recipe" of your search engine, evaluated in the fourth assignment (not so secret because you have to describe it in your documentation). Finally, sort the hits by the resulting score and present the ranked list back to the user as a dynamic (CGI) Web page.

What to turn in

Make sure your code is thoroughly debugged and tested. Always use the -w warning flag and the strict module.

Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.

Test your code extensively. Assume users will input meaningless junk text. Your engine should not crash.

Place your working engine in your CGI directory (if working as a team, use the one of the primary). Your search engine (main) script should be named index.cgi. So your engine should be publicly available at the URL:

	http://capricorn.informatics.indiana.edu/cgi-bin/login/index.cgi
	
where login is your username, or the username of the primary. The equivalent URL for the other (non primary) team member should have a simple script redirecting to the engine page. For redirection, you can use the HTTP header Location.

Copy to a proj directory your script file(s), all Berkeley DB files with your data structures, and an HTML file named proj-readme.html with concise but detailed documentation on all implementation aspects of your search engine. This should be the same as the online documentation page for your search engine. Give credit to any source of assistance (students with whom you discussed your assignments, instructors, books, online sources, etc.) -- note there is no penalty for credited sources but there is penalty for uncredited sources even if admissible. Include your full name and IU network ID (for both team members if working in a team). Include the URL of your search engine. Make sure this documentation is properly formatted as valid XHTML. Hint 1: if your code is properly commented, writing the readme file should be a simple matter of collating all the comments from your code and formatting. All the documentation required in the readme file should already be contained in your code comments. Hint 2: use a text editor to write your documentation, not a WYSIWYG HTML editor, and especially not M$Word.

Create a gzipped archive of the proj directory using the tar czf command as usual (see Assignment 1), naming the resulting file proj-login.tgz where login is your username. If working in a team, name the archive proj-login1+login2.tgz where login1 and login2 are the team members' usernames. Do not include all your crawled pages in this archive -- we can access them on Capricorn. Now upload the archive file to the Project Drop Box of the primary on Oncourse by the deadline. Upload just the proj-readme.html file to the Project Drop Box of the other (non primary) team member, also by the deadline.

The assignment will be graded based on the following criteria:

Correctness
Does the search engine work as expected? Does it use reasonable algorithms and data structures as discussed in class? Did it produce reasonable results? Was it tested thoroughly? Does it check for appropriate input and fail gracefully?
Style
Is the code legible and understandable? Does it use subroutines for clarity, reuse and encapsulation?
Documentation
Is the code thoroughly and clearly commented? Is there an adequate readme file?
May Laziness, Impatience, and Hubris be with you.