Scalable Search Engines
Via Adaptive Topic-Driven Crawlers

Filippo Menczer, Principal Investigator
Department of Management Sciences
The University of Iowa
Iowa City, IA 52242
Phone: (319) 335-0884
Fax : (319) 335-0297
Email: filippo-menczer at uiowa.edu
URL: http://dollar.biz.uiowa.edu/~fil/

WWW PAGE

Project URL: http://dollar.biz.uiowa.edu/~fil/CAREER/

Supported Students

Collaborators

Project Award Information*

Keywords

Search engines, scalability, topic-driven crawlers, intelligent agents, adaptation, personalization, distributed algorithms, collaborative P2P search.

Project Summary

This research project will develop adaptive, personalized, topic-driven crawlers to help search engines. The current search engines do not scale in a dynamic environment. The proposed research will overcome this limitation. The research will utilize an autonomous agent-based approach to building new scalable algorithms. The career development plan will include integrating the results into the undergraduate and graduate curriculum. The students will also carry out projects using the results of this research.

Publications and Products

Papers with preliminary results Tools

Project Impact

The project is just getting started.

Goals, Objectives, and Targeted Activities

Ongoing activities Targets for 2002-2003 Goals for subsequent years

Project References

Area Background

The model behind search engines assumes a static collection, as was the case for earlier information retrieval systems. But since the Web is highly dynamic, indexes are reduced to inaccurate and incomplete "snapshots" of the Web. As a result users are faced with poor recall, precision, and recency. Precision has been improved by ranking algorithms that perform link analysis in addition to traditional lexical analysis, such as Google's PageRank. There is much ongoing research to better understand the Web's link topology and how links can give search engines cues about the meaning of pages.

Another shortcoming of the disjoint processes of crawling/indexing on one hand, and querying on the other, is that the crawl process is not informed by the users. Since search engines cannot cover the whole Web, they make choices as to how to bias their crawling algorithms in favor of certain information resources over others. It would seem preferable to use information gathered from users to guide the crawling algorithms. This requires closing the loop from user queries back to crawling.

These factors point to a need for efficient topic-driven or personalized crawling algorithms. Such crawlers would not run into stale information, and would use knowledge about the topic or the user as context to interpret lexical and linkage clues during their search. Intelligent topic-driven crawlers have several potential applications: replacing "context-free" search engine robots, complementing robots to upkeep search engine indexes for topics with large user bases, performing add-on searches upon user request, building indexes for topical search engines or portals, and performing client-based personalized searches.

Area References


*All award information can be found on the on the NSF on-line 
Awards Abstracts system: http://www.fastlane.nsf.gov/a6/A6Start.htm.