Class RecursiveParserWrapper

  • All Implemented Interfaces:
    java.io.Serializable, Parser

    public class RecursiveParserWrapper
    extends ParserDecorator
    This is a helper class that wraps a parser in a recursive handler. It takes care of setting the embedded parser in the ParseContext and handling the embedded path calculations.

    After parsing a document, call getMetadata() to retrieve a list of Metadata objects, one for each embedded resource. The first item in the list will contain the Metadata for the outer container file.

    Content can also be extracted and stored in the TIKA_CONTENT field of a Metadata object. Select the type of content to be stored at initialization.

    If a WriteLimitReachedException is encountered, the wrapper will stop processing the current resource, and it will not process any of the child resources for the given resource. However, it will try to parse as much as it can. If a WLRE is reached in the parent document, no child resources will be parsed.

    The implementation is based on Jukka's RecursiveMetadataParser and Nick's additions. See: RecursiveMetadataParser.

    Note that this wrapper holds all data in memory and is not appropriate for files with content too large to be held in memory.

    Note, too, that this wrapper is not thread safe because it stores state. The client must initialize a new wrapper for each thread, and the client is responsible for calling reset() after each parse.

    The unit tests for this class are in the tika-parsers module.

    See Also:
    Serialized Form
    • Constructor Detail

      • RecursiveParserWrapper

        public RecursiveParserWrapper​(Parser wrappedParser)
        Initialize the wrapper with catchEmbeddedExceptions set to true as default.
        Parameters:
        wrappedParser - parser to use for the container documents and the embedded documents
      • RecursiveParserWrapper

        public RecursiveParserWrapper​(Parser wrappedParser,
                                      boolean catchEmbeddedExceptions)
        Parameters:
        wrappedParser - parser to wrap
        catchEmbeddedExceptions - whether or not to catch+record embedded exceptions. If set to false, embedded exceptions will be thrown and the rest of the file will not be parsed. The following will not be ignored: CorruptedFileException, RuntimeException
      • RecursiveParserWrapper

        @Deprecated
        public RecursiveParserWrapper​(Parser wrappedParser,
                                      ContentHandlerFactory contentHandlerFactory)
        Initialize the wrapper with catchEmbeddedExceptions set to true as default.
        Parameters:
        wrappedParser - parser to use for the container documents and the embedded documents
        contentHandlerFactory - factory to use to generate a new content handler for the container document and each embedded document
      • RecursiveParserWrapper

        @Deprecated
        public RecursiveParserWrapper​(Parser wrappedParser,
                                      ContentHandlerFactory contentHandlerFactory,
                                      boolean catchEmbeddedExceptions)
        Initialize the wrapper.
        Parameters:
        wrappedParser - parser to use for the container documents and the embedded documents
        contentHandlerFactory - factory to use to generate a new content handler for the container document and each embedded document
        catchEmbeddedExceptions - whether or not to catch the embedded exceptions. If set to true, the stack traces will be stored in the metadata object with key: EMBEDDED_EXCEPTION.
    • Method Detail

      • getSupportedTypes

        public java.util.Set<MediaType> getSupportedTypes​(ParseContext context)
        Description copied from class: ParserDecorator
        Delegates the method call to the decorated parser. Subclasses should override this method (and use super.getSupportedTypes() to invoke the decorated parser) to implement extra decoration.
        Specified by:
        getSupportedTypes in interface Parser
        Overrides:
        getSupportedTypes in class ParserDecorator
        Parameters:
        context - parse context
        Returns:
        immutable set of media types
      • parse

        public void parse​(java.io.InputStream stream,
                          org.xml.sax.ContentHandler recursiveParserWrapperHandler,
                          Metadata metadata,
                          ParseContext context)
                   throws java.io.IOException,
                          org.xml.sax.SAXException,
                          TikaException
        Acts like a regular parser except it ignores the ContentHandler and it automatically sets/overwrites the embedded Parser in the ParseContext object.

        To retrieve the results of the parse, use getMetadata().

        Make sure to call reset() after each parse.

        Specified by:
        parse in interface Parser
        Overrides:
        parse in class ParserDecorator
        Parameters:
        stream - the document stream (input)
        recursiveParserWrapperHandler - handler for the XHTML SAX events (output)
        metadata - document metadata (input and output)
        context - parse context
        Throws:
        java.io.IOException - if the document stream could not be read
        org.xml.sax.SAXException - if the SAX events could not be processed
        TikaException - if the document could not be parsed
      • setMaxEmbeddedResources

        @Deprecated
        public void setMaxEmbeddedResources​(int max)
        Deprecated.
        Set the maximum number of embedded resources to store. If the max is hit during parsing, the EMBEDDED_RESOURCE_LIMIT_REACHED property will be added to the container document's Metadata.

        If this value is < 0 (the default), the wrapper will store all Metadata.

        Parameters:
        max - maximum number of embedded resources to store