Package org.apache.tika.parser.microsoft
Class OfficeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.microsoft.AbstractOfficeParser
-
- org.apache.tika.parser.microsoft.OfficeParser
-
- All Implemented Interfaces:
java.io.Serializable,Parser
public class OfficeParser extends AbstractOfficeParser
Defines a Microsoft document content extractor.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classOfficeParser.POIFSDocumentType
-
Constructor Summary
Constructors Constructor Description OfficeParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidextractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, org.xml.sax.ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor)Helper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader.java.util.Set<MediaType>getSupportedTypes(ParseContext context)Returns the set of media types supported by this parser when used with the given parse context.voidparse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context)Extracts properties and text from an MS Document input stream-
Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getExtractAllAlternativesFromMSG, getExtractMacros, getIncludeDeletedContent, getIncludeMoveFromContent, getUseSAXDocxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Method Detail
-
getSupportedTypes
public java.util.Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context) throws java.io.IOException, org.xml.sax.SAXException, TikaExceptionExtracts properties and text from an MS Document input stream- Parameters:
stream- the document stream (input)handler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)context- parse context- Throws:
java.io.IOException- if the document stream could not be readorg.xml.sax.SAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, org.xml.sax.ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws java.io.IOException, org.xml.sax.SAXExceptionHelper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions- Parameters:
fs- NPOIFS to extract fromxhtml- SAX writerembeddedDocumentExtractor- extractor for embedded documents- Throws:
java.io.IOException- on IOException if it occurs during the extraction of the embedded docorg.xml.sax.SAXException- on SAXException for writing to xhtml
-
-