Class RecursiveParserWrapper
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.ParserDecorator
-
- org.apache.tika.parser.RecursiveParserWrapper
-
- All Implemented Interfaces:
java.io.Serializable,Parser
public class RecursiveParserWrapper extends ParserDecorator
This is a helper class that wraps a parser in a recursive handler. It takes care of setting the embedded parser in the ParseContext and handling the embedded path calculations.After parsing a document, call getMetadata() to retrieve a list of Metadata objects, one for each embedded resource. The first item in the list will contain the Metadata for the outer container file.
Content can also be extracted and stored in the
TIKA_CONTENTfield of a Metadata object. Select the type of content to be stored at initialization.If a WriteLimitReachedException is encountered, the wrapper will stop processing the current resource, and it will not process any of the child resources for the given resource. However, it will try to parse as much as it can. If a WLRE is reached in the parent document, no child resources will be parsed.
The implementation is based on Jukka's RecursiveMetadataParser and Nick's additions. See: RecursiveMetadataParser.
Note that this wrapper holds all data in memory and is not appropriate for files with content too large to be held in memory.
Note, too, that this wrapper is not thread safe because it stores state. The client must initialize a new wrapper for each thread, and the client is responsible for calling
reset()after each parse.The unit tests for this class are in the tika-parsers module.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classRecursiveParserWrapper.WriteLimitReached
-
Field Summary
Fields Modifier and Type Field Description static PropertyEMBEDDED_EXCEPTIONDeprecated.static PropertyEMBEDDED_RESOURCE_LIMIT_REACHEDstatic PropertyEMBEDDED_RESOURCE_PATHDeprecated.static PropertyPARSE_TIME_MILLISDeprecated.static PropertyTIKA_CONTENTDeprecated.static PropertyWRITE_LIMIT_REACHEDDeprecated.
-
Constructor Summary
Constructors Constructor Description RecursiveParserWrapper(Parser wrappedParser)Initialize the wrapper withcatchEmbeddedExceptionsset totrueas default.RecursiveParserWrapper(Parser wrappedParser, boolean catchEmbeddedExceptions)RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory)Deprecated.RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory, boolean catchEmbeddedExceptions)Deprecated.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description java.util.List<Metadata>getMetadata()Deprecated.use aRecursiveParserWrapperHandlerinsteadjava.util.Set<MediaType>getSupportedTypes(ParseContext context)Delegates the method call to the decorated parser.voidparse(java.io.InputStream stream, org.xml.sax.ContentHandler recursiveParserWrapperHandler, Metadata metadata, ParseContext context)Acts like a regular parser except it ignores the ContentHandler and it automatically sets/overwrites the embedded Parser in the ParseContext object.voidreset()Deprecated.use aRecursiveParserWrapperHandlerinsteadvoidsetMaxEmbeddedResources(int max)Deprecated.set this on aRecursiveParserWrapperHandler-
Methods inherited from class org.apache.tika.parser.ParserDecorator
getDecorationName, getWrappedParser, withFallbacks, withoutTypes, withTypes
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Field Detail
-
TIKA_CONTENT
@Deprecated public static final Property TIKA_CONTENT
Deprecated.
-
PARSE_TIME_MILLIS
@Deprecated public static final Property PARSE_TIME_MILLIS
Deprecated.
-
WRITE_LIMIT_REACHED
@Deprecated public static final Property WRITE_LIMIT_REACHED
Deprecated.
-
EMBEDDED_RESOURCE_LIMIT_REACHED
@Deprecated public static final Property EMBEDDED_RESOURCE_LIMIT_REACHED
-
EMBEDDED_EXCEPTION
@Deprecated public static final Property EMBEDDED_EXCEPTION
Deprecated.
-
EMBEDDED_RESOURCE_PATH
@Deprecated public static final Property EMBEDDED_RESOURCE_PATH
Deprecated.
-
-
Constructor Detail
-
RecursiveParserWrapper
public RecursiveParserWrapper(Parser wrappedParser)
Initialize the wrapper withcatchEmbeddedExceptionsset totrueas default.- Parameters:
wrappedParser- parser to use for the container documents and the embedded documents
-
RecursiveParserWrapper
public RecursiveParserWrapper(Parser wrappedParser, boolean catchEmbeddedExceptions)
- Parameters:
wrappedParser- parser to wrapcatchEmbeddedExceptions- whether or not to catch+record embedded exceptions. If set tofalse, embedded exceptions will be thrown and the rest of the file will not be parsed. The following will not be ignored:CorruptedFileException,RuntimeException
-
RecursiveParserWrapper
@Deprecated public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory)
Deprecated.Initialize the wrapper withcatchEmbeddedExceptionsset totrueas default.- Parameters:
wrappedParser- parser to use for the container documents and the embedded documentscontentHandlerFactory- factory to use to generate a new content handler for the container document and each embedded document
-
RecursiveParserWrapper
@Deprecated public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory, boolean catchEmbeddedExceptions)
Deprecated.Initialize the wrapper.- Parameters:
wrappedParser- parser to use for the container documents and the embedded documentscontentHandlerFactory- factory to use to generate a new content handler for the container document and each embedded documentcatchEmbeddedExceptions- whether or not to catch the embedded exceptions. If set totrue, the stack traces will be stored in the metadata object with key:EMBEDDED_EXCEPTION.
-
-
Method Detail
-
getSupportedTypes
public java.util.Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from class:ParserDecoratorDelegates the method call to the decorated parser. Subclasses should override this method (and usesuper.getSupportedTypes()to invoke the decorated parser) to implement extra decoration.- Specified by:
getSupportedTypesin interfaceParser- Overrides:
getSupportedTypesin classParserDecorator- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(java.io.InputStream stream, org.xml.sax.ContentHandler recursiveParserWrapperHandler, Metadata metadata, ParseContext context) throws java.io.IOException, org.xml.sax.SAXException, TikaExceptionActs like a regular parser except it ignores the ContentHandler and it automatically sets/overwrites the embedded Parser in the ParseContext object.To retrieve the results of the parse, use
getMetadata().Make sure to call
reset()after each parse.- Specified by:
parsein interfaceParser- Overrides:
parsein classParserDecorator- Parameters:
stream- the document stream (input)recursiveParserWrapperHandler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)context- parse context- Throws:
java.io.IOException- if the document stream could not be readorg.xml.sax.SAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
getMetadata
@Deprecated public java.util.List<Metadata> getMetadata()
Deprecated.use aRecursiveParserWrapperHandlerinsteadThe first element in the returned list represents the data from the outer container file. There is no guarantee about the ordering of the list after that.- Returns:
- list of Metadata objects that were gathered during the parse
- Throws:
java.lang.IllegalStateException- if you've used aRecursiveParserWrapperHandlerin your last call toparse(InputStream, ContentHandler, Metadata, ParseContext)
-
setMaxEmbeddedResources
@Deprecated public void setMaxEmbeddedResources(int max)
Deprecated.set this on aRecursiveParserWrapperHandlerSet the maximum number of embedded resources to store. If the max is hit during parsing, theEMBEDDED_RESOURCE_LIMIT_REACHEDproperty will be added to the container document's Metadata.If this value is < 0 (the default), the wrapper will store all Metadata.
- Parameters:
max- maximum number of embedded resources to store
-
reset
@Deprecated public void reset()
Deprecated.use aRecursiveParserWrapperHandlerinsteadThis clears the last parser state (metadata list, unknown count, hit embeddedresource count)- Throws:
java.lang.IllegalStateException- if you used aRecursiveParserWrapperin your call toparse(InputStream, ContentHandler, Metadata, ParseContext)
-
-