spider.util
Class HTMLParser

java.lang.Object
  |
  +--javax.swing.text.html.HTMLEditorKit.ParserCallback
        |
        +--spider.util.HTMLParser

public class HTMLParser
extends javax.swing.text.html.HTMLEditorKit.ParserCallback

Description: The class provides methods to parse an html page and convert it into an XML format

Author:
Gautam Pant

Field Summary
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
HTMLParser()
           
HTMLParser(Stopper stop)
          Constructer (if a stopper is provided - stopper alows for removing stop words)
 
Method Summary
 void handleEndTag(javax.swing.text.html.HTML.Tag tag, int pos)
          Handle the end tag.
 void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attribs, int pos)
          Note the start of a tag and put the new state in the state stack
 void handleText(char[] text, int pos)
          Handle text.
 java.lang.String htmlToXML(java.lang.String html, java.lang.String url)
          convert the html into an xml format(naive) Currently all the HTML tags are kept (some corrected) but the only attribute that is stored is href
 boolean isStemmer()
          Returns the stemmer.
 void setStemmer(boolean stemmer)
          Sets the stemmer.
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError, handleSimpleTag
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLParser

public HTMLParser(Stopper stop)
Constructer (if a stopper is provided - stopper alows for removing stop words)


HTMLParser

public HTMLParser()
Method Detail

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attribs,
                           int pos)
Note the start of a tag and put the new state in the state stack

Overrides:
handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int pos)
Handle the end tag. Push it into the tags stack. The implied calls to both handleStartTag and handleEndTag help in correcting bad or missing HTML tags

Overrides:
handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleText

public void handleText(char[] text,
                       int pos)
Handle text. Push it into the tags stack

Overrides:
handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback

htmlToXML

public java.lang.String htmlToXML(java.lang.String html,
                                  java.lang.String url)
convert the html into an xml format(naive) Currently all the HTML tags are kept (some corrected) but the only attribute that is stored is href

Parameters:
html - string - String, the url - String
Returns:
if the conversion failed, else the XML string is returned

isStemmer

public boolean isStemmer()
Returns the stemmer.

Returns:
boolean

setStemmer

public void setStemmer(boolean stemmer)
Sets the stemmer.

Parameters:
stemmer - The stemmer to set