Class OOXMLWordAndPowerPointTextHandler

  • All Implemented Interfaces:
    org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

    public class OOXMLWordAndPowerPointTextHandler
    extends org.xml.sax.helpers.DefaultHandler
    This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc.

    This class does not generally check for namespaces, and it can be applied to PPTX and DOCX for text extraction.

    This can be used to scrape content from charts. It currently ignores formula (<c:f/>) elements

    This does not work with .xlsx or .vsdx. TODO: move this into POI?

    • Method Detail

      • startDocument

        public void startDocument()
                           throws org.xml.sax.SAXException
        Specified by:
        startDocument in interface org.xml.sax.ContentHandler
        Overrides:
        startDocument in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • endDocument

        public void endDocument()
                         throws org.xml.sax.SAXException
        Specified by:
        endDocument in interface org.xml.sax.ContentHandler
        Overrides:
        endDocument in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • startPrefixMapping

        public void startPrefixMapping​(java.lang.String prefix,
                                       java.lang.String uri)
                                throws org.xml.sax.SAXException
        Specified by:
        startPrefixMapping in interface org.xml.sax.ContentHandler
        Overrides:
        startPrefixMapping in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • endPrefixMapping

        public void endPrefixMapping​(java.lang.String prefix)
                              throws org.xml.sax.SAXException
        Specified by:
        endPrefixMapping in interface org.xml.sax.ContentHandler
        Overrides:
        endPrefixMapping in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • startElement

        public void startElement​(java.lang.String uri,
                                 java.lang.String localName,
                                 java.lang.String qName,
                                 org.xml.sax.Attributes atts)
                          throws org.xml.sax.SAXException
        Specified by:
        startElement in interface org.xml.sax.ContentHandler
        Overrides:
        startElement in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • endElement

        public void endElement​(java.lang.String uri,
                               java.lang.String localName,
                               java.lang.String qName)
                        throws org.xml.sax.SAXException
        Specified by:
        endElement in interface org.xml.sax.ContentHandler
        Overrides:
        endElement in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • characters

        public void characters​(char[] ch,
                               int start,
                               int length)
                        throws org.xml.sax.SAXException
        Specified by:
        characters in interface org.xml.sax.ContentHandler
        Overrides:
        characters in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException
      • ignorableWhitespace

        public void ignorableWhitespace​(char[] ch,
                                        int start,
                                        int length)
                                 throws org.xml.sax.SAXException
        Specified by:
        ignorableWhitespace in interface org.xml.sax.ContentHandler
        Overrides:
        ignorableWhitespace in class org.xml.sax.helpers.DefaultHandler
        Throws:
        org.xml.sax.SAXException