Class OOXMLWordAndPowerPointTextHandler
- java.lang.Object
-
- org.xml.sax.helpers.DefaultHandler
-
- org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler
-
- All Implemented Interfaces:
org.xml.sax.ContentHandler
,org.xml.sax.DTDHandler
,org.xml.sax.EntityResolver
,org.xml.sax.ErrorHandler
public class OOXMLWordAndPowerPointTextHandler extends org.xml.sax.helpers.DefaultHandler
This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc. This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction. This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements This does not work with .xlsx or .vsdx. TODO: move this into POI?
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OOXMLWordAndPowerPointTextHandler.EditType
static interface
OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
W_NS
-
Constructor Summary
Constructors Constructor Description OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, java.util.Map<java.lang.String,java.lang.String> hyperlinks)
OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, java.util.Map<java.lang.String,java.lang.String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
characters(char[] ch, int start, int length)
void
endDocument()
void
endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName)
void
endPrefixMapping(java.lang.String prefix)
void
ignorableWhitespace(char[] ch, int start, int length)
void
startDocument()
void
startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)
void
startPrefixMapping(java.lang.String prefix, java.lang.String uri)
-
-
-
Field Detail
-
W_NS
public static final java.lang.String W_NS
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, java.util.Map<java.lang.String,java.lang.String> hyperlinks)
-
OOXMLWordAndPowerPointTextHandler
public OOXMLWordAndPowerPointTextHandler(OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler bodyContentsHandler, java.util.Map<java.lang.String,java.lang.String> hyperlinks, boolean includeTextBox, boolean concatenatePhoneticRuns)
-
-
Method Detail
-
startDocument
public void startDocument() throws org.xml.sax.SAXException
- Specified by:
startDocument
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startDocument
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
endDocument
public void endDocument() throws org.xml.sax.SAXException
- Specified by:
endDocument
in interfaceorg.xml.sax.ContentHandler
- Overrides:
endDocument
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
startPrefixMapping
public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) throws org.xml.sax.SAXException
- Specified by:
startPrefixMapping
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startPrefixMapping
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
endPrefixMapping
public void endPrefixMapping(java.lang.String prefix) throws org.xml.sax.SAXException
- Specified by:
endPrefixMapping
in interfaceorg.xml.sax.ContentHandler
- Overrides:
endPrefixMapping
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
startElement
public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts) throws org.xml.sax.SAXException
- Specified by:
startElement
in interfaceorg.xml.sax.ContentHandler
- Overrides:
startElement
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
endElement
public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws org.xml.sax.SAXException
- Specified by:
endElement
in interfaceorg.xml.sax.ContentHandler
- Overrides:
endElement
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
characters
public void characters(char[] ch, int start, int length) throws org.xml.sax.SAXException
- Specified by:
characters
in interfaceorg.xml.sax.ContentHandler
- Overrides:
characters
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
ignorableWhitespace
public void ignorableWhitespace(char[] ch, int start, int length) throws org.xml.sax.SAXException
- Specified by:
ignorableWhitespace
in interfaceorg.xml.sax.ContentHandler
- Overrides:
ignorableWhitespace
in classorg.xml.sax.helpers.DefaultHandler
- Throws:
org.xml.sax.SAXException
-
-