CAREER: Scalable Search Engines Via Adaptive Topic-Driven Crawlers

Sponsored by the National Science Foundation
Information and Data Management Program

PI: Filippo Menczer

The goal of this project is to build build intelligent, adaptive, personalized, topic-driven crawlers to help search engines overcome their current limitations in scaling with the dynamic Web. Today's search engines cannot keep up with the swift changes and growth of the Web because they are based on a static model of the environment and therefore they separate spidering and querying for the sake of efficiency. Closing the loop from user queries back to crawling will allow the crawling process to be informed by the interests and preferences of the users.

We are developing an agent-based approach to building more scalable search algorithms. We are particularly interested in novel distributed algorithms and representations used to construct populations of adaptive crawlers. These InfoSpiders will browse networked information environments online in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion. Each crawler will adapt to the spatial and temporal regularities of its local context thanks to a combination of machine learning techniques inspired by ecological models: evolutionary adaptation with local selection, reinforcement learning and selective query expansion by internalization of environmental signals, and optional relevance feedback.

A variable number of crawlers will collaborate to carry out the search process. This multi-agent collective will be biased by the success of crawlers situated in relevant portions of the Web, and by possible feedback from the user. Interactions among crawler are limited to the sharing of finite environmental resources. Such an algorithm lends itself to concurrent and distributed implementations. Eventually, mobile crawlers will migrate to remote servers and execute near the data. We envision the use of peer-to-peer protocols to facilitate searching across indexes built by distributed crawlers by means of agent collaboration, mediation and referral.

Such algorithms inspired by multi-agent ecologies are being evaluated along with other crawling algorithms in a framework developed to ensure use of equal resources and to benchmark algorithmic complexity. This way crawler performance can be evaluated using quality/cost analyses and the scalability of the algorithms can be tested by varying their available resources.

We plan to integrate these crawling algorithms with search engines. Adaptive crawlers can be employed to replace existing blind robots for indexing, to perform add-on query-driven crawls whose results are to enhance the index, and to build indexes for topical search engines. The integration of adaptive topic-driven crawlers with the current search engine technology will lead to new search tools that will scale better with the dynamic Web and achieve better coverage and recency than today's engines, thus helping users to find relevant information as soon as it appears. Integration will also allow to amortize the costs of query-driven crawls.

Empirical measures are being carried out to quantify the correlation between document proximity in word vector space, hypertext space, and semantic space. The results will shed light on the viability of online navigation by autonomous agents and on the value of lexical and linkage cues to guide topic-driven crawlers. Preliminary results suggest that both lexical and linkage cues are valuable to guide online search toward relevant pages.

NSF logo This material is based upon work supported by the National Science Foundation under award N. IIS-0133124 and IIS-0348940. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.