Class ParserContainerExtractor

  • All Implemented Interfaces:
    java.io.Serializable, ContainerExtractor

    public class ParserContainerExtractor
    extends java.lang.Object
    implements ContainerExtractor
    An implementation of ContainerExtractor powered by the regular Parser API. This allows you to easily extract out all the embedded resources from within container files supported by normal Tika parsers. By default the AutoDetectParser will be used, to allow extraction from the widest range of containers.
    See Also:
    Serialized Form
    • Constructor Detail

      • ParserContainerExtractor

        public ParserContainerExtractor()
      • ParserContainerExtractor

        public ParserContainerExtractor​(TikaConfig config)
      • ParserContainerExtractor

        public ParserContainerExtractor​(Parser parser,
                                        Detector detector)
    • Method Detail

      • extract

        public void extract​(TikaInputStream stream,
                            ContainerExtractor recurseExtractor,
                            EmbeddedResourceHandler handler)
                     throws java.io.IOException,
                            TikaException
        Description copied from interface: ContainerExtractor
        Processes a container file, and extracts all the embedded resources from within it.

        The EmbeddedResourceHandler you supply will be called for each embedded resource in the container. It is up to you whether you process the contents of the resource or not.

        The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

        If required, nested containers (such as a .docx within a .zip) can automatically be recursed into, and processed inline. If no recurseExtractor is given, the nested containers will be treated as with any other embedded resources.

        Specified by:
        extract in interface ContainerExtractor
        Parameters:
        stream - the document stream (input)
        recurseExtractor - the extractor to use on any embedded containers
        handler - handler for the embedded files (output)
        Throws:
        java.io.IOException - if the document stream could not be read
        TikaException - if the container could not be parsed