Scalable Search Engines
Via Adaptive Topic-Driven Crawlers

Filippo Menczer, Principal Investigator
Department of Management Sciences
The University of Iowa
Iowa City, IA 52242
Phone: (319) 335-0884
Fax : (319) 335-0297
Email: filippo-menczer at uiowa.edu
URL: http://dollar.biz.uiowa.edu/~fil/

WWW PAGE

Project URL: http://dollar.biz.uiowa.edu/~fil/CAREER/

Supported Students

Gautam Pant, PhD student

Collaborators

Padmini Srinivasan, School of Library and Information Science, The University of Iowa
David Eichmann, School of Library and Information Science, The University of Iowa
Nick Street, Department of Management Sciences, The University of Iowa

Project Award Information^*

Award Number: IIS-0133124
Duration: 04/01/2002 -- 02/28/2007
Title: CAREER: Scalable Search Engines via Adaptive Topic-driven Crawlers

Keywords

Search engines, scalability, topic-driven crawlers, intelligent agents, adaptation, personalization, distributed algorithms, collaborative P2P search.

Project Summary

This research project will develop adaptive, personalized, topic-driven crawlers to help search engines. The current search engines do not scale in a dynamic environment. The proposed research will overcome this limitation. The research will utilize an autonomous agent-based approach to building new scalable algorithms. The career development plan will include integrating the results into the undergraduate and graduate curriculum. The students will also carry out projects using the results of this research.

Publications and Products

Papers with preliminary results

F. Menczer: Links tell us about lexical and semantic Web content. CoRR cs.IR/0108004
G. Pant, P. Srinivasan, F. Menczer: Exploration versus Exploitation in Topic Driven Crawlers. To appear in Proc. WWW 2002 Workshop on Web Dynamics
G. Pant, F. Menczer: MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5(2): 221-229, 2002
F. Menczer: Complementing Search Engines with Online Web Mining Agents. Decision Support Systems Journal, in press
F. Menczer, G. Pant, M. Ruiz, P. Srinivasan: Evaluating Topic-Driven Web Crawlers. Proc. ACM SIGIR 2001: 241-249
M. Degeratu, G. Pant, F. Menczer: Latency-dependent fitness in evolutionary multithreaded Web agents. Proc. GECCO 2001 EcoMAS Workshop

Tools

MySpiders Applet

Project Impact

The project is just getting started.

A PhD student has been selected as RA with funding to start Fall 2002
A new PhD course on "Research Topics in Web Search" is being developed and will be offered 2002-2003
A collaborations has been initiated with a large pharmaceutical company to build a specialized search engine employing topic-driven crawlers for competitive intelligence applications
An analysis of Web lexical and linkage topology appeared in a preliminary report has received some attention in the press (Technology Research News)

Goals, Objectives, and Targeted Activities

Ongoing activities

Deployed multi-threaded MySpiders applet; engaging many users for evaluation
Analyzing relationship between linkage and lexical topologies on the Web
Studying Web semantic similarity distributions as functions of lexical and linkage similarity distributions
Studying effective metrics to evaluate crawler performance

Targets for 2002-2003

Evaluate crawler algorithms on TREC Web Track collections and on full Web
Analyze complexity and performance/cost trade-offs of various crawler algorithms
Consider effect of resources available to crawlers on performance

Goals for subsequent years

Study conditions for localized online algorithms to find short paths
Experiment with extensions and variations of basic crawler algorithms
benefit() and cost() functions based on similarity, latency, authority, novelty, recency, etc.
Look-ahead and other machine learning methods to improve navigation performance
Relevance functions (e.g., weighting schemes with localized versions of IDF)
Different reinforcement learning and relevance feedback schemes
Study various aspects of possible integration with search engines
Simulate search engines to compare scalability (consider coverage, recency, and network traffic)
Compare recall, precision and recency of retrieved pages with actual search engines on full Web
Get starting points, refer pages back to engines, incorporate adaptive agent algorithms into crawlers
Extend adaptive crawler algorithms to use mobile/distributed/p2p agent protocols
Consider extensions of existing HTTP protocol to allow for query-driven interaction with servers
Consider peer-to-peer protocols such as Gnutella to build collaborative topic-driven crawlers
Address distributed system implementation issues (e.g., security, control, communication, caching)
Contrast efficiency gains from distributing crawling computation randomly vs. near-data

Project References

F. Menczer, R.K. Belew: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning 39 (2/3): 203-242, 2000
F. Menczer, A.E. Monge: Scalable Web Search by Adaptive Online Agents: An InfoSpiders Case Study. In M. Klusch, ed., Intelligent Information Agents, Springer, 1999
F. Menczer, R.K. Belew: Adaptive Information Agents in Distributed Textual Environments. Proceedings of the 2nd International Conference on Autonomous Agents, 1998
F. Menczer: ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. Proceedings of the 14th International Conference on Machine Learning, 1997

Area Background

The model behind search engines assumes a static collection, as was the case for earlier information retrieval systems. But since the Web is highly dynamic, indexes are reduced to inaccurate and incomplete "snapshots" of the Web. As a result users are faced with poor recall, precision, and recency. Precision has been improved by ranking algorithms that perform link analysis in addition to traditional lexical analysis, such as Google's PageRank. There is much ongoing research to better understand the Web's link topology and how links can give search engines cues about the meaning of pages.

Another shortcoming of the disjoint processes of crawling/indexing on one hand, and querying on the other, is that the crawl process is not informed by the users. Since search engines cannot cover the whole Web, they make choices as to how to bias their crawling algorithms in favor of certain information resources over others. It would seem preferable to use information gathered from users to guide the crawling algorithms. This requires closing the loop from user queries back to crawling.

These factors point to a need for efficient topic-driven or personalized crawling algorithms. Such crawlers would not run into stale information, and would use knowledge about the topic or the user as context to interpret lexical and linkage clues during their search. Intelligent topic-driven crawlers have several potential applications: replacing "context-free" search engine robots, complementing robots to upkeep search engine indexes for topics with large user bases, performing add-on searches upon user request, building indexes for topical search engines or portals, and performing client-based personalized searches.

Area References

S. Chakrabarti: Mining the Web: Discovering Knowledge From Hypertext Data. Morgan Kaufmann, 2002
G. Flake, S. Lawrence, C.L. Giles and F. Coetzee: Self-Organization of the Web and Identification of Communities. IEEE Computer 35(3): 66-71, 2002
B. Huberman: The Laws of the Web: Patterns in the Ecology of Information. MIT Press, 2001
J. Kleinberg and Steve Lawrence: The Structure of the Web. Science 294(5548): 1849-1850, 2001
A. Broder, S.R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener: Graph structure in the Web. Computer Networks 33(1-6): 309-320, 2000
J. Kleinberg: Navigation in a Small World. Nature 406: 845, 2000
M. Henzinger: Link Analysis in Web Information Retrieval. IEEE Bull. Data Engineering 23(3): 3-8, 2000
B. Amento, L. Terveen and W. Hill: Does "authority" mean quality? Predicting expert quality ratings of Web documents. Proc. ACM SIGIR 2000: 296-303
A. Albert, H. Jeong and A.-L. Barabasi: Diameter of the World Wide Web. Nature 401: 130-131, 1999
J. Kleinberg: Authoritative sources in a hyperlinked environment. Journal of the ACM 46: 604-632, 1999
S. Lawrence and C.L. Giles: Accessibility of Information on the Web. Nature 400(6740): 107-109, 1999
S. Chakrabarti, B. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson and J. Kleinberg: Mining the Web's Link Structure. IEEE Computer 32(8): 60-69, 1999
S. Chakrabarti, M. van den Berg and B. Dom: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16): 1623-1640, 1999
J. Dean and M. Henzinger: Finding Related Pages in the World Wide Web. Computer Networks 31(11-16): 1467-1479, 1999
S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117, 1998
J. Cho, H. Garcia-Molina and L. Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172, 1998

^*All award information can be found on the on the NSF on-line 
Awards Abstracts system: http://www.fastlane.nsf.gov/a6/A6Start.htm.

Scalable Search EnginesVia Adaptive Topic-Driven Crawlers