Class ExtractorFactory


  • public final class ExtractorFactory
    extends java.lang.Object
    Figures out the correct POITextExtractor for your supplied document, and returns it.

    Note 1 - will fail for many file formats if the POI Scratchpad jar is not present on the runtime classpath

    Note 2 - rather than using this, for most cases you would be better off switching to Apache Tika instead!

    • Field Detail

      • CORE_DOCUMENT_REL

        public static final java.lang.String CORE_DOCUMENT_REL
        See Also:
        Constant Field Values
    • Method Detail

      • getThreadPrefersEventExtractors

        public static boolean getThreadPrefersEventExtractors()
        Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.
      • getAllThreadsPreferEventExtractors

        public static java.lang.Boolean getAllThreadsPreferEventExtractors()
        Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.
      • setThreadPrefersEventExtractors

        public static void setThreadPrefersEventExtractors​(boolean preferEventExtractors)
        Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.
      • setAllThreadsPreferEventExtractors

        public static void setAllThreadsPreferEventExtractors​(java.lang.Boolean preferEventExtractors)
        Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.
      • getPreferEventExtractor

        public static boolean getPreferEventExtractor()
        Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.
      • createExtractor

        public static POITextExtractor createExtractor​(OPCPackage pkg)
                                                throws java.io.IOException,
                                                       OpenXML4JException,
                                                       XmlException
        Tries to determine the actual type of file and produces a matching text-extractor for it.
        Parameters:
        pkg - An OPCPackage.
        Returns:
        A POIXMLTextExtractor for the given file.
        Throws:
        java.io.IOException - If an error occurs while reading the file
        OpenXML4JException - If an error parsing the OpenXML file format is found.
        XmlException - If an XML parsing error occurs.
        java.lang.IllegalArgumentException - If no matching file type could be found.
      • getEmbededDocsTextExtractors

        @Deprecated
        @Removal(version="4.2")
        @NotImplemented
        public static POITextExtractor[] getEmbededDocsTextExtractors​(POIXMLTextExtractor ext)
        Deprecated.
        Use the method with correct "embedded"
        Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one open POITextExtractor for each embedded file.
      • getEmbeddedDocsTextExtractors

        @NotImplemented
        public static POITextExtractor[] getEmbeddedDocsTextExtractors​(POIXMLTextExtractor ext)
        Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one open POITextExtractor for each embedded file.