Package org.apache.tika.parser.html
Class BoilerpipeContentHandler
- java.lang.Object
-
- de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
- org.apache.tika.parser.html.BoilerpipeContentHandler
-
- All Implemented Interfaces:
org.xml.sax.ContentHandler
public class BoilerpipeContentHandler extends de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Uses the boilerpipe library to automatically extract the main content from a web page. Use this as aContentHandler
object passed toHtmlParser.parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)
-
-
Constructor Summary
Constructors Constructor Description BoilerpipeContentHandler(java.io.Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] chars, int offset, int length)
void
endDocument()
void
endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName)
de.l3s.boilerpipe.document.TextDocument
getTextDocument()
Retrieves the built TextDocumentboolean
isIncludeMarkup()
void
setIncludeMarkup(boolean includeMarkup)
void
startDocument()
void
startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)
void
startPrefixMapping(java.lang.String prefix, java.lang.String uri)
-
-
-
Constructor Detail
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate)
Creates a new boilerpipe-based content extractor, using theDefaultExtractor
extraction rules and "delegate" as the content handler.- Parameters:
delegate
- TheContentHandler
object
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(java.io.Writer writer)
Creates a content handler that writes XHTML body character events to the given writer.- Parameters:
writer
- writer
-
BoilerpipeContentHandler
public BoilerpipeContentHandler(org.xml.sax.ContentHandler delegate, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
Creates a new boilerpipe-based content extractor, using the given extraction rules. The extracted main content will be passed to thecontent handler. - Parameters:
delegate
- TheContentHandler
objectextractor
- Extraction rules to use, e.g.ArticleExtractor
-
-
Method Detail
-
isIncludeMarkup
public boolean isIncludeMarkup()
-
setIncludeMarkup
public void setIncludeMarkup(boolean includeMarkup)
-
getTextDocument
public de.l3s.boilerpipe.document.TextDocument getTextDocument()
Retrieves the built TextDocument- Returns:
- TextDocument
-
startDocument
public void startDocument() throws org.xml.sax.SAXException
- Specified by:
startDocument
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
startPrefixMapping
public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) throws org.xml.sax.SAXException
- Specified by:
startPrefixMapping
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startPrefixMapping
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
startElement
public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts) throws org.xml.sax.SAXException
- Specified by:
startElement
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
characters
public void characters(char[] chars, int offset, int length) throws org.xml.sax.SAXException
- Specified by:
characters
in interfaceorg.xml.sax.ContentHandler
- Overrides:
characters
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
endElement
public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws org.xml.sax.SAXException
- Specified by:
endElement
in interfaceorg.xml.sax.ContentHandler
- Overrides:
endElement
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
endDocument
public void endDocument() throws org.xml.sax.SAXException
- Specified by:
endDocument
in interfaceorg.xml.sax.ContentHandler
- Overrides:
endDocument
in classde.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
- Throws:
org.xml.sax.SAXException
-
-