For the final project, you will pull together the components that you have developed in the assignments throughout the semester and build a real search engine, in Perl. If the components you have work well, the project should consist mostly of gluing them together and adding a Web interface. Compared to commercial search engines, yours will be quite simplistic and small scale, but it should work.
You may work in pairs. No team can have more than two people. It is your responsibility to find a partner, if you decide to work in a team. If you work in a team, you can integrate the best (more efficient, robust, etc.) components developed by the team members throughout the semester. Each team should identify a primary team member for the purpose of turning in the assignment.
You are strongly advised to start working early. Read and follow the directions in this document very carefully, as they are filled with details that are meant to facilitate your task. You may discuss the assignment with instructors and other students (in class or online), but not with people outside class. You may consult printed and/or online references, books, tutorials, etc. Most importantly, you are to write your code individually and any work you submit must be your own.
Use the crawler you developed in the first assignment to crawl a small
portion of the Web (say, 1000 pages). You are encouraged to modify your
crawler to make it more efficient. For example, you might shuffle the
frontier to minimize delay imposed by the RobotUA
module
for successive requests to the same server. As seed pages, use some
pages about a topic of your interest. For example, if you start from
informatics websites, you can name your search engine someghing like
Informoogle.
Use the indexer you developed in the second assignment to parse your
crawled pages, filter stop words, stem, extract links, and
build the inverted index and various other data
structures needed for text analysis. Additionally, you should
extract the titles from the crawled pages. This is done by a small
extension to the parser, which would save the title of each page in
a tied (DB_File
) hash. If the title is missing in a page,
use untitled
or the URL of the page as a title.
Use the PageRank algorithm (developed in the third assignment) to compute the PageRank of each page in your index, and store this in a Berkeley DB file.
Use the CGI
module to develop a simple, clean, friendly Web
site that will be the front-end of your search engine. This will run in
your CGI directory on Capricorn.
The interface should have a logo, a text field for the
query, a search button, and a link to online (XHTML) documentation. When
the user submits a query, s/he will be presented with a ranked list of
hits. Each hit should display the title of the page, its URL, and a link
to the page itself. This information is available from the Berkeley DB
files created by your crawler and indexer. If there are too many hits
(say more than 100), display only the first 100. For extra
credit, you can display just 10 hits per screen if there are more
than 10 hits. The user would then browse through the results by clicking
on appropriate navigational links on each screen.
When the user submits a query to your search engine, you should use the scripts developed in the third assignment to build a vector representation of the query (as you did for pages) and then present to the user a ranked list of hits. In particular, use the index to retrieve the hit list (list of pages matching the query). For each page in the hit list, compute the text (cosine) similarity between the page and query TFIDF vectors. Then combine this similarity with the PageRank of each hit; how you do this is the "secret recipe" of your search engine, evaluated in the fourth assignment (not so secret because you have to describe it in your documentation). Finally, sort the hits by the resulting score and present the ranked list back to the user as a dynamic (CGI) Web page.
Make sure your code is thoroughly debugged and tested. Always use the
-w
warning flag and the strict
module.
Make sure your code is thoroughly legible, understandable, and commented. Cryptic or uncommented code is not acceptable.
Test your code extensively. Assume users will input meaningless junk text. Your engine should not crash.
Place your working engine in your CGI directory (if working as a team,
use the one of the primary). Your search engine
(main) script should be named index.cgi
. So your engine
should be publicly available at the URL:
http://capricorn.informatics.indiana.edu/cgi-bin/login/index.cgiwhere
login
is your username, or the username of the
primary. The equivalent URL for the other (non primary) team member
should have a simple script redirecting to the engine page. For redirection,
you can use the HTTP header Location.
Copy to a proj
directory your script file(s), all Berkeley
DB files with your data structures, and an HTML file named
proj-readme.html
with concise but detailed documentation on
all implementation aspects of your search engine. This should be the
same as the online documentation page for your search
engine. Give credit to any source of assistance (students with whom you
discussed your assignments, instructors, books, online sources, etc.) --
note there is no penalty for credited sources but there is penalty for
uncredited sources even if admissible. Include your full name and IU
network ID (for both team members if working in a
team). Include the URL of your search engine. Make sure this
documentation is properly formatted as valid XHTML. Hint 1: if your code is
properly commented, writing the readme file should be a simple matter of
collating all the comments from your code and formatting. All the
documentation required in the readme file should already be contained in
your code comments. Hint 2: use a text editor to write your
documentation, not a WYSIWYG HTML editor, and especially not M$Word.
Create a gzipped archive of the proj
directory using the
tar czf
command as usual (see Assignment 1), naming the
resulting file proj-login.tgz
where login
is
your username. If working in a team, name the archive
proj-login1+login2.tgz
where login1
and
login2
are the team members' usernames. Do
not include all your crawled pages in this archive --
we can access them on Capricorn. Now upload the archive file to the Project
Drop Box of the primary on Oncourse by the deadline. Upload just the
proj-readme.html
file to the Project Drop Box of the other
(non primary) team member, also by the deadline.
The assignment will be graded based on the following criteria: