Project Report for IDM Workshop 2004

CAREER: Scalable Search Engines Via Adaptive Topic-Driven Crawlers

IIS-0348940

Principal Investigator

Filippo Menczer
School of Informatics and Department of Computer Science
Indiana University
Bloomington
IN
47408
812-856-1377
812-856-4764
fil at indiana.edu
http://informatics.indiana.edu/fil/

Collaborator (supported PhD student)

Ruj Akavipat
Department of Computer Science
Indiana University
Bloomington
IN
47405
rakavipa at cs.indiana.edu
http://www.cs.indiana.edu/~rakavipa/

Collaborator (supported PhD student)

Le-Shin Wu
Department of Computer Science
Indiana University
Bloomington
IN
47405
lewu at cs.indiana.edu
http://www.cs.indiana.edu/~lewu/

Keywords

topical web crawlers; scalable peer-based search engines; distributed, collaborative, adaptive search agents; lexical, link, semantic topology; web growth models

Project Summary

This research aims to study ways to build more scalable and personalizable search engines by developing adaptive, topical crawling algorithms and distributed, coolaborative, peer based retrieval networks with intelligent query routing via online mining of search data. The project builds on a theoretical/empirical framework aimed at the development of realistic models to characterize the dynamics and topological features of the Web, and their relationships with appropriate notions of relevance. The career development plan includes integrating the results into the undergraduate and graduate curriculum. This plan has been strengthened by the PI's transfer to the School of Informatics and Department of Computer Science at Indiana University.

Publications

F. Menczer: The Evolution of Document Networks. Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
F. Menczer: Correlated topologies in citation networks and the Web. Eur. Phys. J. B 38(2): 211-221, 2004
F. Menczer, G. Pant, P. Srinivasan: Topical Web Crawlers: Evaluating Adaptive Algorithms. To appear in ACM TOIT
F. Menczer: Lexical and Semantic Clustering by Web Links. To appear in JASIST
P. Srinivasan, F. Menczer, G. Pant: A General Evaluation Framework for Topical Crawlers. Final version to appear in Information Retrieval
F. Menczer: Combining Link and Content Analysis to Estimate Semantic Similarity. Proc. 13th Intl. WWW Conf. Alt. Track Papers and Posters, pp. 452-453, 2004
R. Akavipat, L.-S. Wu, F. Menczer: Small World Networks in Distributed Web Search. Proc. 13th Intl. WWW Conf. Alt. Track Papers and Posters, pp. 396-397, 2004
G. Pant, P. Srinivasan, F. Menczer: Crawling the Web. In M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2004
M.S. Aktas, M.A. Nacar, F. Menczer: Personalizing PageRank Based on Domain Profiles. Proc. 6th SIGKDD Workshop on Web Mining and Web Usage Analysis (WebKDD 2004)
A. Maguitman, D. Leake, T. Reichherzer, F. Menczer: Dynamic Extraction of Topic Descriptors and Discriminators: Towards Automatic Context-Based Topic Search. Proc. 13th Conf. on Information and Knowledge Management (CIKM 2004)
P. Srinivasan, F. Menczer, G. Pant: Defining Evaluation Methodologies for Topical Crawlers. SIGIR 2003 Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections
G. Pant, S. Bradshaw, F. Menczer: Search engine-crawler symbiosis: Adapting to Community Interests. Proc. ECDL 2003
G. Pant, F. Menczer: Topical crawling for business intelligence. Proc. ECDL 2003
F. Menczer: Complementing Search Engines with Online Web Mining Agents. Decision Support Systems 35(2): 195-212, 2003
F. Menczer: Growing and Navigating the Small World Web by Local Content. Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
G. Pant, F. Menczer: MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5(2): 221-229, 2002
P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, F. Menczer: Web Crawling Agents for Retrieving Biomedical Information. Proc. Int. Workshop on Agents in Bioinformatics (NETTAB 2002)
G. Pant, P. Srinivasan, F. Menczer: Exploration versus Exploitation in Topic Driven Crawlers. WWW 2002 Workshop on Web Dynamics

Project Impact

Gautam Pant was funded in year 1 of the project while a PhD student in the Management Sciences Department at the University of Iowa. Dr. Pant obtained his PhD in July 2004 with a dissertation titled Learning to Crawl: Classifier-guided Topical Crawlers and is currently an assistant professor at the University of Utah.
A collaborations with a large pharmaceutical company to build a specialized search engine has proven successful. The work started as an application under this project, and continued as a summer internship for Gautam Pant in 2002. Novel topical crawling algorithms were developed for competitive intelligence applications.
Collaborations have been very active and fruitful with Padmini Srinivasan and Shannon Bradshaw at Iowa. New collaborations are under way at Indiana with faculty in Informatics, Computer Science, and SLIS.
Papers resulting from this project have been/will be presented at various conferences and workshops: FOCS Workshop on Algorithms for the Web Graph (WAW 2002), Workshop on Agents in Bioinformatics (NETTAB 2002), ECDL 2003, SIGIR 2003 (Workshop on Defining Evaluation Methodologies for Terabyte-Scale Collections), Conference on Growing Networks and Graphs in Statistical Physics, Finance, Biology and Social Systems, CIKM 2004, WebKDD 2004, and WWW 2004.
The PI has been invited to present work funded by this project at several universities and organizations including IUPUI, Notre Dame, U. Michigan, Sigma Xi, PennState, U. Rochester, OGI, U. Oregon, and U. Pittsburgh.
Work funded by the project has been cited in the press (Technology Research News, COMPUTERWORLD, Complexity Digest, ScienceDaily, PNAS News Archive, Insight, Ascribe, etc.). The PI was interviewed on BBC World Service and MPR's Future Tense.
Two new courses related to the project have been developed: a computer science graduate seminar titled B659 Web mining and an informatics undergratuate course on search engines titled I400 Google under the Hood. The former was offered by the PI in Spring 2004; it was very successful in terms of enrollment and student evaluations, and will be offered again the future. The latter is being offered by the PI in Fall 2004, also with very good enrollment.
The PI gave a lecture on search engines during an informatics summer camp for women and minority high school students from Indiana. An online resource developed on this occasion to illustrate how search engines and PageRank work to high school students has subsequently been employed by other IU faculty in presentations to junior and senior high schools.

Goals, Objectives and Targeted Activities

Completed/ongoing activities

Deployed multi-threaded MySpiders applet.
Developed and disseminated tasks and metrics to evaluate crawlers; proposed a topical crawling task in the TREC Web Track.
Experimented with various extensions and variations of InfoSpiders crawler algorithm.
Developed InfoSpiders algorithm for topical crawling that outperformed other crawlers in literature.
Analyzed complexity and performance/cost trade-offs of various crawler algorithms.
Studied role of different adaptive techniques (reinforcement learning, evolutionary bias, query expansion) on crawling performance.
Studied conditions for localized Web crawling algorithms to find short paths.
Developed Web growth model to obtain realistic distribution of textual similarity across linked pages in addition to realistic degree distribution.
Exploring various aspects of possible integration with search engines, eg, synergistic/symbiotic relationship between topical crawlers and indexing algorithms for vertical search engines.
Studying relationship between topologies induced by link and lexical similarity across Web pages and ways to combine them for approximating semantic relevance.
Simulated peer-based collaborative crawling/searching algorithms.
Designing and implementing peer-based distributed, collaborative crawling framework (6S).

New targets for 2004-2005

Consider effect of resources available to distributed vs. centralized crawlers on performance.
Simulate search engines to compare scalability (consider coverage, recency, and network traffic).
Compare recall, precision and recency of retrieved pages with actual search engines on full Web.
Begin deploying 6S; evaluate with both automatic metrics and human subject experiments.
Develop algorithms for streaming search data mining, adaptive query routing, and result integration.

Area Background

The model behind search engines assumes a static collection, as was the case for earlier information retrieval systems. But since the Web is highly dynamic, indexes are reduced to inaccurate and incomplete "snapshots" of the Web. As a result users are faced with poor recall, precision, and recency. Precision has been improved by ranking algorithms that perform link analysis in addition to traditional lexical analysis, such as Google's PageRank. There is much ongoing research to better understand the Web's link topology and how links can give search engines cues about the meaning of pages.

Another shortcoming of the disjoint processes of crawling/indexing on one hand, and querying on the other, is that the crawl process is not informed by the users. Since search engines cannot cover the whole Web, they make choices as to how to bias their crawling algorithms in favor of certain information resources over others. It would seem preferable to use information gathered from users to guide the crawling algorithms.

These factors point to a need for efficient topic-driven or personalized crawling algorithms. Such crawlers would not run into stale information, and would use knowledge about the topic or the user as context to interpret lexical and linkage cues during their search. Many micro-search engines built from personalized crawlers can collaborate in a peer network to achieve better scalability and coverage than one-size-fits-all, cetralized search engines.

Area References

S. Chakrabarti: Mining the Web: Discovering Knowledge From Hypertext Data. Morgan Kaufmann, 2002
R.K. Belew: Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 2000

Potential Related Projects

Peer-to-Peer Networks for Self-Organizing Virtual Communities

Project Website

http://informatics.indiana.edu/fil/CAREER/ Summary of project with links to reports, papers, software and other related pages.

Illustrations

http://informatics.indiana.edu/fil/IS/ Description of applied part of the project related to topical crawling algorithms, with links to demo applet, articles in the press, and some demo QuickTime movies.
http://informatics.indiana.edu/fil/Web/ Description of theoretical/empirical part of project, with links to related presentations and articles in the press, and a graphical representation of the semantic maps obtained by studying the relationship between different Web topologies.
http://myspiders.informatics.indiana.edu/ MySpiders applet allows users to launch adaptive Web crawlers in real time and observe their performance as well as their evolving internal representations; user can select between two crawling algorithms and number of pages to be visited.
http://homer.informatics.indiana.edu/cgi-bin/pagerank/network.cgi Demo of search engines' use of content and link analysis for ranking results; aimed at high school students.

Online Software and Data

http://informatics.indiana.edu/fil/IS/Framework/ Crawler evaluation framework includes script to create targets and seeds from Open Directory (http://dmoz.org) and sample target and seed data files.
http://informatics.indiana.edu/fil/IS/JavaCrawlers/ Java library for multi-threaded topical web crawlers implementing BreadthFirst and BestNFirst algorithms.