Project Report for IDM Workshop 2003

CAREER: Scalable Search Engines Via Adaptive Topic-Driven Crawlers

IIS-0133124

Principal Investigator

Filippo Menczer
School of Informatics and Department of Computer Science
Indiana University
Bloomington IN 47408
fil@indiana.edu
http://informatics.indiana.edu/fil/

Collaborator

Gautam Pant
Department of Management Sciences
University of Iowa
Iowa City IA 52242
gautam-pant@uiowa.edu
http://dollar.biz.uiowa.edu/~pant/

Keywords

topical web crawlers; scalable search engines; distributed, collaborative, adaptive search agents; lexical, link, semantic topology; web growth models

Project Summary

This research project is developing adaptive, personalized, distributed, topical crawlers to build search engines that scale with the dynamic nature of the Web. The project has a theoretical/empirical aspect aimed at the development of realistic models to characterize the growth and topologies of the Web, and an applied aspect to utilize this knowledge in building adaptive systems for scalable search and retrieval algorithms. The career development plan includes integrating the results into the undergraduate and graduate curriculum. This plan has been strengthened by the PI's recent transfer to the School of Informatics and Department of Computer Science at Indiana University.

Publications

P. Srinivasan, F. Menczer, G. Pant: Defining Evaluation Methodologies for Topical Crawlers. SIGIR 2003 Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections
F. Menczer, G. Pant, P. Srinivasan: Topical Web Crawlers: Evaluating Adaptive Algorithms. To appear in ACM TOIT
G. Pant, S. Bradshaw, F. Menczer: Search engine-crawler symbiosis: Adapting to Community Interests. Proc. ECDL 2003
G. Pant, F. Menczer: Topical crawling for business intelligence. Proc. ECDL 2003
F. Menczer: Complementing Search Engines with Online Web Mining Agents. Decision Support Systems 35(2): 195-212, 2003
G. Pant, P. Srinivasan, F. Menczer: Crawling the Web. To appear in M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2003
F. Menczer: Growing and Navigating the Small World Web by Local Content. Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
G. Pant, F. Menczer: MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5(2): 221-229, 2002
P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, F. Menczer: Web Crawling Agents for Retrieving Biomedical Information. Proc. Int. Workshop on Agents in Bioinformatics (NETTAB 2002)
G. Pant, P. Srinivasan, F. Menczer: Exploration versus Exploitation in Topic Driven Crawlers. WWW 2002 Workshop on Web Dynamics

Project Impact

Gautam Pant, PhD student in the Management Sciences Department at the University of Iowa, has been funded in year 1 of the project. His thesis proposal on Topical Crawling: Infrastructure, Algorithms, applications, Evaluation has been successfully defended. A new RA is currently being selected in the CS department at Indiana University.
A collaborations with a large pharmaceutical company to build a specialized search engine has proven successful. The work started as an application under this project, and continued as a summer internship for Gautam Pant in 2002. Novel topical crawling algorithms were developed for competitive intelligence applications.
Collaborations have been very active and fruitful with Padmini Srinivasan and Shannon Bradshaw at Iowa. New collaborations are being initiated at Indiana with faculty in Informatics, Computer Science, SLIS, and other units.
Papers resulting from this project have been presented at various conferences and workshops: FOCS Workshop on Algorithms for the Web Graph (WAW 2002), Workshop on Agents in Bioinformatics (NETTAB 2002), ECDL 2003, SIGIR 2003 Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections, Conference on Growing Networks and Graphs in Statistical Physics, Finance, Biology and Social Systems
Work funded by the project has been cited in the press (Technology Research News, Complexity Digest, PNAS News Archive, Insight, Ascribe, etc.). The PI was interviewed on BBC World Service.
Two new courses related to the project are being developed: an undergratuate seminar in informatics (I400) on search engines and a graduate seminar in computer science (B659) on web mining. Both will be offered by the PI at IU in Spring 2004.

Goals, Objectives and Targeted Activities

Completed/ongoing activities

Deployed multi-threaded MySpiders applet; soliciting user feedback.
Developed and disseminated tasks and metrics to evaluate crawlers; proposing a topical crawling task in the TREC Web Track.
Developed new InfoSpiders algorithm for topical crawling; outperformed other crawlers in literature.
Analyzed complexity and performance/cost trade-offs of various crawler algorithms.
Studied role of different adaptive techniques (reinforcement learning, evolutionary bias, query expansion) on crawling performance.
Studying relationship between topologies induced by linkage, lexical, and semantic similiarity across Web pages.
Studied conditions for localized Web crawling algorithms to find short paths.
Developing Web growth model to obtain realistic distribution of textual similarity across linked pages in addition to realistic degree distribution.

New targets for 2003-2004

Study various aspects of possible integration with search engines
Explore synergistic/symbiotic relationship between topical crawlers and indexing algorithms for vertical search engines
Begin design of peer-based distributed, collaborative crawling framework
Consider effect of resources available to crawlers on performance

Area Background

The model behind search engines assumes a static collection, as was the case for earlier information retrieval systems. But since the Web is highly dynamic, indexes are reduced to inaccurate and incomplete "snapshots" of the Web. As a result users are faced with poor recall, precision, and recency. Precision has been improved by ranking algorithms that perform link analysis in addition to traditional lexical analysis, such as Google's PageRank. There is much ongoing research to better understand the Web's link topology and how links can give search engines cues about the meaning of pages.

Another shortcoming of the disjoint processes of crawling/indexing on one hand, and querying on the other, is that the crawl process is not informed by the users. Since search engines cannot cover the whole Web, they make choices as to how to bias their crawling algorithms in favor of certain information resources over others. It would seem preferable to use information gathered from users to guide the crawling algorithms.

These factors point to a need for efficient topic-driven or personalized crawling algorithms. Such crawlers would not run into stale information, and would use knowledge about the topic or the user as context to interpret lexical and linkage cues during their search.

Area Reference

S. Chakrabarti: Mining the Web: Discovering Knowledge From Hypertext Data. Morgan Kaufmann, 2002

Project Website

http://informatics.indiana.edu/fil/CAREER/ Summary of project with links to reports, papers, software and other related pages

Illustrations

http://informatics.indiana.edu/fil/IS/ Description of applied part of the project related to topical crawling algorithms, with links to demo applet, articles in the press, and some demo QuickTime movies
http://informatics.indiana.edu/fil/Web/ Description of theoretical/empirical part of project, with links to related presentations and articles in the press, and a graphical representation of the semantic maps obtained by studying the relationship between different Web topologies.

Online Software

http://myspiders.biz.uiowa.edu/ MySpiders applet allows users to launch adaptive Web crawlers in real time and observe their performance as well as their evolving internal representations; user can select between two crawling algorithms and number of pages to be visited

Online Data

http://informatics.indiana/edu/fil/IS/Framework/ Crawler evaluation framework includes script to create targets and seeds from Open Directory (http://dmoz.org) and sample target and seed data files