Class PDFMarkedContent2XHTML


  • public class PDFMarkedContent2XHTML
    extends org.apache.pdfbox.text.PDFTextStripper

    This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

    Since:
    1.24
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int getCurrentPageNo()
      we need to override this because we are overriding processPages(PDPageTree)
      int getStartPage()  
      static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
      Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
      void processPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartPage​(int startPage)  
      • Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

        getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
      • Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

        addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • XMP_DOCUMENT_CATALOG_LOCATION

        public static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
        See Also:
        Constant Field Values
      • XMP_PAGE_LOCATION_PREFIX

        public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
        See Also:
        Constant Field Values
    • Method Detail

      • process

        public static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
                                   org.xml.sax.ContentHandler handler,
                                   ParseContext context,
                                   Metadata metadata,
                                   PDFParserConfig config)
                            throws org.xml.sax.SAXException,
                                   TikaException
        Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
        Parameters:
        pdDocument - PDF document
        handler - SAX content handler
        metadata - PDF metadata
        Throws:
        org.xml.sax.SAXException - if the content handler fails to process SAX events
        TikaException - if there was an exception outside of per page processing
      • processPage

        public void processPage​(org.apache.pdfbox.pdmodel.PDPage page)
                         throws java.io.IOException
        Overrides:
        processPage in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        java.io.IOException
      • getCurrentPageNo

        public int getCurrentPageNo()
        we need to override this because we are overriding processPages(PDPageTree)
        Returns:
      • setStartBookmark

        public void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • setEndBookmark

        public void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • setStartPage

        public void setStartPage​(int startPage)
        Overrides:
        setStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • getStartPage

        public int getStartPage()
        Overrides:
        getStartPage in class org.apache.pdfbox.text.PDFTextStripper