Package org.apache.tika.parser.microsoft
Class OfficeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.microsoft.AbstractOfficeParser
-
- org.apache.tika.parser.microsoft.OfficeParser
-
- All Implemented Interfaces:
java.io.Serializable
,Parser
public class OfficeParser extends AbstractOfficeParser
Defines a Microsoft document content extractor.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OfficeParser.POIFSDocumentType
-
Constructor Summary
Constructors Constructor Description OfficeParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
extractMacros(POIFSFileSystem fs, org.xml.sax.ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor)
Helper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader.java.util.Set<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.void
parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context)
Extracts properties and text from an MS Document input stream-
Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getExtractAllAlternativesFromMSG, getExtractMacros, getIncludeDeletedContent, getIncludeMoveFromContent, getUseSAXDocxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Method Detail
-
getSupportedTypes
public java.util.Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context) throws java.io.IOException, org.xml.sax.SAXException, TikaException
Extracts properties and text from an MS Document input stream- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
java.io.IOException
- if the document stream could not be readorg.xml.sax.SAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
extractMacros
public static void extractMacros(POIFSFileSystem fs, org.xml.sax.ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws java.io.IOException, org.xml.sax.SAXException
Helper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions- Parameters:
fs
- NPOIFS to extract fromxhtml
- SAX writerembeddedDocumentExtractor
- extractor for embedded documents- Throws:
java.io.IOException
- on IOException if it occurs during the extraction of the embedded docorg.xml.sax.SAXException
- on SAXException for writing to xhtml
-
-