BasicCrawler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

spider.crawl
Class BasicCrawler

java.lang.Object
  |
  +--spider.crawl.BasicCrawler

Direct Known Subclasses:: BestFirst, DOMCrawler

public class BasicCrawler
extends java.lang.Object

Performs breadth first crawling. It treats the frontier as a FIFO queue picking up the next URL based on the order in which it was added to it. The crawler does not visit a page that it has already visited.

Author:: Gautam Pant

Constructor Summary

BasicCrawler(java.lang.String[] seeds, long maxPages, java.lang.String dir)
          construct the crawler with the seeds

Method Summary

long getMaxFrontier()
          Returns the maxFrontier.

long getMaxPages()
          Returns the maxPages.

int getMaxThreads()
          Returns the maxThreads.

java.lang.String getStorageFile()
          Returns the storageFile.

int getTopN()
          Returns the topN.

boolean reStartCrawl()
          Allows to restart the crawler based on the last state of the history.

void setFrontierAdd(boolean b)
          set the frontier to allow (true) or disallow (false) addition of new URLs.

void setMaxFrontier(int maxFrontier)
          Sets the maxFrontier - maximum size of the frontier.

void setMaxThreads(int maxThreads)
          Sets the maxThreads - maximum number of threads.

void setStatFile(java.lang.String statFile)
          Sets the statFile.

void setStorageFile(java.lang.String storageFile)
          Sets the storageFile.

void setTopN(int topN)
          Sets the topN.

boolean startCrawl()


Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

BasicCrawler

public BasicCrawler(java.lang.String[] seeds,
                    long maxPages,
                    java.lang.String dir)

construct the crawler with the seeds
Parameters:: seeds - - URLs that are starting points for crawl maxPages - maximum pages to be fetched dir - the directory to store the results in (the directory is created if it does not exist)

Method Detail

startCrawl

public boolean startCrawl()

getMaxFrontier

public long getMaxFrontier()

Returns the maxFrontier.

Returns:: long

setMaxFrontier

public void setMaxFrontier(int maxFrontier)

Sets the maxFrontier - maximum size of the frontier.

Parameters:: maxFrontier - The maxFrontier to set

getMaxThreads

public int getMaxThreads()

Returns the maxThreads.

Returns:: int

setMaxThreads

public void setMaxThreads(int maxThreads)

Sets the maxThreads - maximum number of threads.

Parameters:: maxThreads - The maxThreads to set

getMaxPages

public long getMaxPages()

Returns the maxPages.

Returns:: long

getTopN

public int getTopN()

Returns the topN.

Returns:: int

setTopN

public void setTopN(int topN)

Sets the topN.

Parameters:: topN - The topN to set

getStorageFile

public java.lang.String getStorageFile()

Returns the storageFile.

Returns:: String

setStorageFile

public void setStorageFile(java.lang.String storageFile)

Sets the storageFile.

Parameters:: storageFile - The storageFile to set

setStatFile

public void setStatFile(java.lang.String statFile)

Sets the statFile.

Parameters:: statFile - The statFile to set

reStartCrawl

public boolean reStartCrawl()

Allows to restart the crawler based on the last state of the history. loads history and fill up the corresponding frontier

setFrontierAdd

public void setFrontierAdd(boolean b)

set the frontier to allow (true) or disallow (false) addition of new URLs.

Parameters:: b -

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Constructor Summary
`BasicCrawler(java.lang.String[] seeds, long maxPages, java.lang.String dir)` construct the crawler with the seeds

Method Summary
`long`	`getMaxFrontier()` Returns the maxFrontier.
`long`	`getMaxPages()` Returns the maxPages.
`int`	`getMaxThreads()` Returns the maxThreads.
`java.lang.String`	`getStorageFile()` Returns the storageFile.
`int`	`getTopN()` Returns the topN.
`boolean`	`reStartCrawl()` Allows to restart the crawler based on the last state of the history.
`void`	`setFrontierAdd(boolean b)` set the frontier to allow (true) or disallow (false) addition of new URLs.
`void`	`setMaxFrontier(int maxFrontier)` Sets the maxFrontier - maximum size of the frontier.
`void`	`setMaxThreads(int maxThreads)` Sets the maxThreads - maximum number of threads.
`void`	`setStatFile(java.lang.String statFile)` Sets the statFile.
`void`	`setStorageFile(java.lang.String storageFile)` Sets the storageFile.
`void`	`setTopN(int topN)` Sets the topN.
`boolean`	`startCrawl()`

spider.crawl Class BasicCrawler

BasicCrawler

startCrawl

getMaxFrontier

setMaxFrontier

getMaxThreads

setMaxThreads

getMaxPages

getTopN

setTopN

getStorageFile

setStorageFile

setStatFile

reStartCrawl

setFrontierAdd

spider.crawl
Class BasicCrawler