Package org.apache.tika.extractor
Interface ContainerExtractor
-
- All Superinterfaces:
java.io.Serializable
- All Known Implementing Classes:
ParserContainerExtractor
public interface ContainerExtractor extends java.io.Serializable
Tika container extractor interface. Container Extractors provide access to the embedded resources within container formats such as .zip and .doc
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler)
Processes a container file, and extracts all the embedded resources from within it.boolean
isSupported(TikaInputStream input)
Is this Container Extractor able to process the supplied container?
-
-
-
Method Detail
-
isSupported
boolean isSupported(TikaInputStream input) throws java.io.IOException
Is this Container Extractor able to process the supplied container?- Throws:
java.io.IOException
- Since:
- Apache Tika 0.8
-
extract
void extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler) throws java.io.IOException, TikaException
Processes a container file, and extracts all the embedded resources from within it.The
EmbeddedResourceHandler
you supply will be called for each embedded resource in the container. It is up to you whether you process the contents of the resource or not.The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.
If required, nested containers (such as a .docx within a .zip) can automatically be recursed into, and processed inline. If no recurseExtractor is given, the nested containers will be treated as with any other embedded resources.
- Parameters:
stream
- the document stream (input)recurseExtractor
- the extractor to use on any embedded containershandler
- handler for the embedded files (output)- Throws:
java.io.IOException
- if the document stream could not be readTikaException
- if the container could not be parsed- Since:
- Apache Tika 0.8
-
-