Class Summary |
ActiveThreads |
keeping track of crawler threads so that they can be stopped when none of them have further URLs to crawl |
BadExtensions |
A list of bad extensions that need not be kept in the frontier |
BasicCrawler |
Performs breadth first crawling. |
BestFirst |
Best First crawler (that is extended from a BreadthFirst crawler) |
Cache |
|
DOMCrawler |
A crawler that builds context of a URL through DOM tree representation of an HTML page |
FetcherPool |
A pool of multi-threaded fetchers that can be used to fetch many pages at the same time |
Frontier |
|
FrontierElement |
|
Globals |
Global parameters that are used by the crawlers |
History |
helps to maintain history of a crawl with timestamps |
HubSeeker |
|
Statistics |
maintain statistics related with crawl
1. |
Tester |
Example code for running a crawler
make sure you put a valid e-mail address |