Scalable Search Engines
Via Adaptive Topic-Driven Crawlers
Filippo Menczer, Principal Investigator
Department of Management Sciences
The University of Iowa
Iowa City, IA 52242
Phone: (319) 335-0884
Fax : (319) 335-0297
Email: filippo-menczer
at uiowa.edu
URL: http://dollar.biz.uiowa.edu/~fil/
WWW PAGE
Project URL:
http://dollar.biz.uiowa.edu/~fil/CAREER/
Supported Students
Collaborators
- Padmini Srinivasan, School of Library and Information Science, The University of Iowa
- David Eichmann, School of Library and Information Science, The University of Iowa
- Nick Street, Department of Management Sciences, The University of Iowa
Project Award Information*
- Award Number: IIS-0133124
- Duration: 04/01/2002 -- 02/28/2007
- Title: CAREER: Scalable Search Engines via Adaptive Topic-driven Crawlers
Keywords
Search engines, scalability, topic-driven crawlers, intelligent agents,
adaptation, personalization, distributed algorithms, collaborative P2P search.
Project Summary
This research project will develop adaptive, personalized, topic-driven
crawlers to help search engines. The current search engines do not scale
in a dynamic environment. The proposed research will overcome this limitation.
The research will utilize an autonomous agent-based approach to building
new scalable algorithms. The career development plan will include integrating
the results into the undergraduate and graduate curriculum. The students
will also carry out projects using the results of this research.
Publications and Products
Papers with preliminary results
- F. Menczer: Links tell us about lexical and semantic Web content. CoRR cs.IR/0108004
- G. Pant, P. Srinivasan, F. Menczer: Exploration versus Exploitation in Topic Driven Crawlers. To appear in Proc. WWW 2002 Workshop on Web Dynamics
- G. Pant, F. Menczer: MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems 5(2): 221-229, 2002
- F. Menczer: Complementing Search Engines with Online Web Mining Agents. Decision Support Systems Journal, in press
- F. Menczer, G. Pant, M. Ruiz, P. Srinivasan: Evaluating Topic-Driven Web Crawlers. Proc. ACM SIGIR 2001: 241-249
- M. Degeratu, G. Pant, F. Menczer: Latency-dependent fitness in evolutionary multithreaded Web agents. Proc. GECCO 2001 EcoMAS Workshop
Tools
Project Impact
The project is just getting started.
- A PhD student has been selected as RA with funding to start Fall 2002
- A new PhD course on "Research Topics in Web Search" is being developed and will be offered 2002-2003
- A collaborations has been initiated with a large pharmaceutical company to build a specialized search
engine employing topic-driven crawlers for competitive intelligence applications
- An analysis of Web lexical and linkage topology appeared in a preliminary report has received some attention in
the press (Technology
Research News)
Goals, Objectives, and Targeted Activities
Ongoing activities
- Deployed multi-threaded MySpiders applet; engaging many users for evaluation
- Analyzing relationship between linkage and lexical topologies on the Web
- Studying Web semantic similarity distributions as functions of lexical and
linkage similarity distributions
- Studying effective metrics to evaluate crawler performance
Targets for 2002-2003
- Evaluate crawler algorithms on TREC Web Track collections and on full Web
- Analyze complexity and performance/cost trade-offs of various crawler algorithms
- Consider effect of resources available to crawlers on performance
Goals for subsequent years
- Study conditions for localized online algorithms to find short paths
- Experiment with extensions and variations of basic crawler algorithms
- benefit() and cost() functions based on similarity, latency, authority, novelty, recency, etc.
- Look-ahead and other machine learning methods to improve navigation performance
- Relevance functions (e.g., weighting schemes with localized versions of IDF)
- Different reinforcement learning and relevance feedback schemes
- Study various aspects of possible integration with search engines
- Simulate search engines to compare scalability (consider coverage, recency, and network traffic)
- Compare recall, precision and recency of retrieved pages with actual search engines on full Web
- Get starting points, refer pages back to engines, incorporate adaptive agent algorithms into crawlers
- Extend adaptive crawler algorithms to use mobile/distributed/p2p agent protocols
- Consider extensions of existing HTTP protocol to allow for query-driven interaction with servers
- Consider peer-to-peer protocols such as Gnutella to build collaborative topic-driven crawlers
- Address distributed system implementation issues (e.g., security, control, communication, caching)
- Contrast efficiency gains from distributing crawling computation randomly vs. near-data
Project References
- F. Menczer, R.K. Belew: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning 39 (2/3): 203-242, 2000
- F. Menczer, A.E. Monge: Scalable Web Search by Adaptive Online Agents: An InfoSpiders Case Study. In M. Klusch, ed., Intelligent Information Agents, Springer, 1999
- F. Menczer, R.K. Belew: Adaptive Information Agents in Distributed Textual Environments. Proceedings of the 2nd International Conference on Autonomous Agents, 1998
- F. Menczer: ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. Proceedings of the 14th International Conference on Machine Learning, 1997
Area Background
The model behind search engines assumes a static collection, as was the case for earlier information retrieval systems. But since the Web is highly dynamic, indexes are reduced to inaccurate and incomplete "snapshots" of the Web. As a result users are faced with poor recall, precision, and recency. Precision has been improved by ranking algorithms that perform link analysis in addition to traditional lexical analysis, such as Google's PageRank. There is much ongoing research to better understand the Web's link topology and how links can give search engines cues about the meaning of pages.
Another shortcoming of the disjoint processes of crawling/indexing on one hand, and querying on the other, is that the crawl process is not informed by the users. Since search engines cannot cover the whole Web, they make choices as to how to bias their crawling algorithms in favor of certain information resources over others. It would seem preferable to use information gathered from users to guide the crawling algorithms. This requires closing the loop from user queries back to crawling.
These factors point to a need for efficient topic-driven or personalized crawling algorithms. Such crawlers would not run into stale information, and would use knowledge about the topic or the user as context to interpret lexical and linkage clues during their search. Intelligent topic-driven crawlers have several potential applications: replacing "context-free" search engine robots, complementing robots to upkeep search engine indexes for topics with large user bases, performing add-on searches upon user request, building indexes for topical search engines or portals, and performing client-based personalized searches.
Area References
- S. Chakrabarti: Mining the Web: Discovering Knowledge From Hypertext Data. Morgan Kaufmann, 2002
- G. Flake, S. Lawrence, C.L. Giles and F. Coetzee: Self-Organization of the Web and Identification of Communities. IEEE Computer 35(3): 66-71, 2002
- B. Huberman: The Laws of the Web: Patterns in the Ecology of Information. MIT Press, 2001
- J. Kleinberg and Steve Lawrence: The Structure of the Web. Science 294(5548): 1849-1850, 2001
- A. Broder, S.R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener: Graph structure in the Web. Computer Networks 33(1-6): 309-320, 2000
- J. Kleinberg: Navigation in a Small World. Nature 406: 845, 2000
- M. Henzinger: Link Analysis in Web Information Retrieval. IEEE Bull. Data Engineering 23(3): 3-8, 2000
- B. Amento, L. Terveen and W. Hill: Does "authority" mean quality? Predicting expert quality ratings of Web documents. Proc. ACM SIGIR 2000: 296-303
- A. Albert, H. Jeong and A.-L. Barabasi: Diameter of the World Wide Web. Nature 401: 130-131, 1999
- J. Kleinberg: Authoritative sources in a hyperlinked environment. Journal of the ACM 46: 604-632, 1999
- S. Lawrence and C.L. Giles: Accessibility of Information on the Web. Nature 400(6740): 107-109, 1999
- S. Chakrabarti, B. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson and J. Kleinberg: Mining the Web's Link Structure. IEEE Computer 32(8): 60-69, 1999
- S. Chakrabarti, M. van den Berg and B. Dom: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16): 1623-1640, 1999
- J. Dean and M. Henzinger: Finding Related Pages in the World Wide Web. Computer Networks 31(11-16): 1467-1479, 1999
- S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117, 1998
- J. Cho, H. Garcia-Molina and L. Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172, 1998
*All award information can be found on the on the NSF on-line
Awards Abstracts system: http://www.fastlane.nsf.gov/a6/A6Start.htm.