spider.crawl
Class BasicCrawler

java.lang.Object
  |
  +--spider.crawl.BasicCrawler
Direct Known Subclasses:
BestFirst, DOMCrawler

public class BasicCrawler
extends java.lang.Object

Performs breadth first crawling. It treats the frontier as a FIFO queue picking up the next URL based on the order in which it was added to it. The crawler does not visit a page that it has already visited.

Author:
Gautam Pant

Constructor Summary
BasicCrawler(java.lang.String[] seeds, long maxPages, java.lang.String dir)
          construct the crawler with the seeds
 
Method Summary
 long getMaxFrontier()
          Returns the maxFrontier.
 long getMaxPages()
          Returns the maxPages.
 int getMaxThreads()
          Returns the maxThreads.
 java.lang.String getStorageFile()
          Returns the storageFile.
 int getTopN()
          Returns the topN.
 boolean reStartCrawl()
          Allows to restart the crawler based on the last state of the history.
 void setFrontierAdd(boolean b)
          set the frontier to allow (true) or disallow (false) addition of new URLs.
 void setMaxFrontier(int maxFrontier)
          Sets the maxFrontier - maximum size of the frontier.
 void setMaxThreads(int maxThreads)
          Sets the maxThreads - maximum number of threads.
 void setStatFile(java.lang.String statFile)
          Sets the statFile.
 void setStorageFile(java.lang.String storageFile)
          Sets the storageFile.
 void setTopN(int topN)
          Sets the topN.
 boolean startCrawl()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BasicCrawler

public BasicCrawler(java.lang.String[] seeds,
                    long maxPages,
                    java.lang.String dir)
construct the crawler with the seeds

Parameters:
seeds - - URLs that are starting points for crawl maxPages - maximum pages to be fetched dir - the directory to store the results in (the directory is created if it does not exist)
Method Detail

startCrawl

public boolean startCrawl()

getMaxFrontier

public long getMaxFrontier()
Returns the maxFrontier.

Returns:
long

setMaxFrontier

public void setMaxFrontier(int maxFrontier)
Sets the maxFrontier - maximum size of the frontier.

Parameters:
maxFrontier - The maxFrontier to set

getMaxThreads

public int getMaxThreads()
Returns the maxThreads.

Returns:
int

setMaxThreads

public void setMaxThreads(int maxThreads)
Sets the maxThreads - maximum number of threads.

Parameters:
maxThreads - The maxThreads to set

getMaxPages

public long getMaxPages()
Returns the maxPages.

Returns:
long

getTopN

public int getTopN()
Returns the topN.

Returns:
int

setTopN

public void setTopN(int topN)
Sets the topN.

Parameters:
topN - The topN to set

getStorageFile

public java.lang.String getStorageFile()
Returns the storageFile.

Returns:
String

setStorageFile

public void setStorageFile(java.lang.String storageFile)
Sets the storageFile.

Parameters:
storageFile - The storageFile to set

setStatFile

public void setStatFile(java.lang.String statFile)
Sets the statFile.

Parameters:
statFile - The statFile to set

reStartCrawl

public boolean reStartCrawl()
Allows to restart the crawler based on the last state of the history. loads history and fill up the corresponding frontier


setFrontierAdd

public void setFrontierAdd(boolean b)
set the frontier to allow (true) or disallow (false) addition of new URLs.

Parameters:
b -