Class XMLReaderUtils

  • All Implemented Interfaces:
    java.io.Serializable

    public class XMLReaderUtils
    extends java.lang.Object
    implements java.io.Serializable
    Utility functions for reading XML. If you are doing SAX parsing, make sure to use the OfflineContentHandler to guard against XML External Entity attacks.
    See Also:
    Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      XMLReaderUtils()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.w3c.dom.Document buildDOM​(java.io.InputStream is)
      Builds a Document with a DocumentBuilder from the pool
      static org.w3c.dom.Document buildDOM​(java.io.InputStream is, ParseContext context)
      This checks context for a user specified DocumentBuilder.
      static org.w3c.dom.Document buildDOM​(java.lang.String uriString)
      Builds a Document with a DocumentBuilder from the pool
      static org.w3c.dom.Document buildDOM​(java.nio.file.Path path)
      Builds a Document with a DocumentBuilder from the pool
      static java.lang.String getAttrValue​(java.lang.String localName, org.xml.sax.Attributes atts)  
      static javax.xml.parsers.DocumentBuilder getDocumentBuilder()
      Returns the DOM builder specified in this parsing context.
      static javax.xml.parsers.DocumentBuilderFactory getDocumentBuilderFactory()
      Returns the DOM builder factory specified in this parsing context.
      static int getMaxEntityExpansions()  
      static int getPoolSize()  
      static javax.xml.parsers.SAXParser getSAXParser()
      Returns the SAX parser specified in this parsing context.
      static javax.xml.parsers.SAXParserFactory getSAXParserFactory()
      Returns the SAX parser factory specified in this parsing context.
      static javax.xml.transform.Transformer getTransformer()
      Returns a new transformer
      static javax.xml.stream.XMLInputFactory getXMLInputFactory()
      Returns the StAX input factory specified in this parsing context.
      static org.xml.sax.XMLReader getXMLReader()
      Returns the XMLReader specified in this parsing context.
      static void parseSAX​(java.io.InputStream is, org.xml.sax.helpers.DefaultHandler contentHandler, ParseContext context)
      This checks context for a user specified SAXParser.
      static void setMaxEntityExpansions​(int maxEntityExpansions)
      Set the maximum number of entity expansions allowable in SAX/DOM/StAX parsing.
      static void setPoolSize​(int poolSize)
      Set the pool size for cached XML parsers.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_POOL_SIZE

        public static final int DEFAULT_POOL_SIZE
        Default size for the pool of SAX Parsers and the pool of DOM builders
        See Also:
        Constant Field Values
      • DEFAULT_MAX_ENTITY_EXPANSIONS

        public static final int DEFAULT_MAX_ENTITY_EXPANSIONS
        See Also:
        Constant Field Values
    • Constructor Detail

      • XMLReaderUtils

        public XMLReaderUtils()
    • Method Detail

      • setMaxEntityExpansions

        public static void setMaxEntityExpansions​(int maxEntityExpansions)
        Set the maximum number of entity expansions allowable in SAX/DOM/StAX parsing. NOTE:A value less than or equal to zero indicates no limit. This will override the system property JAXP_ENTITY_EXPANSION_LIMIT_KEY and the DEFAULT_MAX_ENTITY_EXPANSIONS value for allowable entity expansions

        NOTE: To trigger a rebuild of the pool of parsers with this setting, the client must call setPoolSize(int) to rebuild the SAX and DOM parsers with this setting.

        Parameters:
        maxEntityExpansions - -- maximum number of allowable entity expansions
        Since:
        Apache Tika 1.19
      • getXMLReader

        public static org.xml.sax.XMLReader getXMLReader()
                                                  throws TikaException
        Returns the XMLReader specified in this parsing context. If a reader is not explicitly specified, then one is created using the specified or the default SAX parser.
        Returns:
        XMLReader
        Throws:
        TikaException
        Since:
        Apache Tika 1.13
        See Also:
        getSAXParser()
      • getSAXParser

        public static javax.xml.parsers.SAXParser getSAXParser()
                                                        throws TikaException
        Returns the SAX parser specified in this parsing context. If a parser is not explicitly specified, then one is created using the specified or the default SAX parser factory.

        Make sure to wrap your handler in the OfflineContentHandler to prevent XML External Entity attacks

        If you call reset() on the parser, make sure to replace the SecurityManager which will be cleared by xerces2 on reset().

        Returns:
        SAX parser
        Throws:
        TikaException - if a SAX parser could not be created
        Since:
        Apache Tika 0.8
        See Also:
        getSAXParserFactory()
      • getSAXParserFactory

        public static javax.xml.parsers.SAXParserFactory getSAXParserFactory()
        Returns the SAX parser factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware, not validating, and to use secure XML processing.

        Make sure to wrap your handler in the OfflineContentHandler to prevent XML External Entity attacks

        Returns:
        SAX parser factory
        Since:
        Apache Tika 0.8
      • getDocumentBuilderFactory

        public static javax.xml.parsers.DocumentBuilderFactory getDocumentBuilderFactory()
        Returns the DOM builder factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security features.
        Returns:
        DOM parser factory
        Since:
        Apache Tika 1.13
      • getDocumentBuilder

        public static javax.xml.parsers.DocumentBuilder getDocumentBuilder()
                                                                    throws TikaException
        Returns the DOM builder specified in this parsing context. If a builder is not explicitly specified, then a builder instance is created and returned. The builder instance is configured to apply an IGNORING_SAX_ENTITY_RESOLVER, and it sets the ErrorHandler to null.
        Returns:
        DOM Builder
        Throws:
        TikaException
        Since:
        Apache Tika 1.13
      • getXMLInputFactory

        public static javax.xml.stream.XMLInputFactory getXMLInputFactory()
        Returns the StAX input factory specified in this parsing context. If a factory is not explicitly specified, then a default factory instance is created and returned. The default factory instance is configured to be namespace-aware and to apply reasonable security using the IGNORING_STAX_ENTITY_RESOLVER.
        Returns:
        StAX input factory
        Since:
        Apache Tika 1.13
      • getTransformer

        public static javax.xml.transform.Transformer getTransformer()
                                                              throws TikaException
        Returns a new transformer

        The transformer instance is configured to to use secure XML processing.

        Returns:
        Transformer
        Throws:
        TikaException - when the transformer can not be created
        Since:
        Apache Tika 1.17
      • buildDOM

        public static org.w3c.dom.Document buildDOM​(java.io.InputStream is,
                                                    ParseContext context)
                                             throws TikaException,
                                                    java.io.IOException,
                                                    org.xml.sax.SAXException
        This checks context for a user specified DocumentBuilder. If one is not found, this reuses a DocumentBuilder from the pool.
        Parameters:
        is - InputStream to parse
        context - context to use
        Returns:
        a document
        Throws:
        TikaException
        java.io.IOException
        org.xml.sax.SAXException
        Since:
        Apache Tika 1.19
      • buildDOM

        public static org.w3c.dom.Document buildDOM​(java.nio.file.Path path)
                                             throws TikaException,
                                                    java.io.IOException,
                                                    org.xml.sax.SAXException
        Builds a Document with a DocumentBuilder from the pool
        Parameters:
        path - path to parse
        Returns:
        a document
        Throws:
        TikaException
        java.io.IOException
        org.xml.sax.SAXException
        Since:
        Apache Tika 1.19.1
      • buildDOM

        public static org.w3c.dom.Document buildDOM​(java.lang.String uriString)
                                             throws TikaException,
                                                    java.io.IOException,
                                                    org.xml.sax.SAXException
        Builds a Document with a DocumentBuilder from the pool
        Parameters:
        uriString - uriString to process
        Returns:
        a document
        Throws:
        TikaException
        java.io.IOException
        org.xml.sax.SAXException
        Since:
        Apache Tika 1.19.1
      • buildDOM

        public static org.w3c.dom.Document buildDOM​(java.io.InputStream is)
                                             throws TikaException,
                                                    java.io.IOException,
                                                    org.xml.sax.SAXException
        Builds a Document with a DocumentBuilder from the pool
        Returns:
        a document
        Throws:
        TikaException
        java.io.IOException
        org.xml.sax.SAXException
        Since:
        Apache Tika 1.19.1
      • parseSAX

        public static void parseSAX​(java.io.InputStream is,
                                    org.xml.sax.helpers.DefaultHandler contentHandler,
                                    ParseContext context)
                             throws TikaException,
                                    java.io.IOException,
                                    org.xml.sax.SAXException
        This checks context for a user specified SAXParser. If one is not found, this reuses a SAXParser from the pool.
        Parameters:
        is - InputStream to parse
        contentHandler - handler to use
        context - context to use
        Throws:
        TikaException
        java.io.IOException
        org.xml.sax.SAXException
        Since:
        Apache Tika 1.19
      • setPoolSize

        public static void setPoolSize​(int poolSize)
                                throws TikaException
        Set the pool size for cached XML parsers. This has a side effect of locking the pool, and rebuilding the pool from scratch with the most recent settings, such as MAX_ENTITY_EXPANSIONS
        Parameters:
        poolSize -
        Throws:
        TikaException
        Since:
        Apache Tika 1.19
      • getPoolSize

        public static int getPoolSize()
      • getMaxEntityExpansions

        public static int getMaxEntityExpansions()
      • getAttrValue

        public static java.lang.String getAttrValue​(java.lang.String localName,
                                                    org.xml.sax.Attributes atts)
        Parameters:
        localName -
        atts -
        Returns:
        attribute value with that local name or null if not found