Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
-
public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripperThis was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringXMP_DOCUMENT_CATALOG_LOCATIONstatic java.lang.StringXMP_PAGE_LOCATION_PREFIX
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description intgetCurrentPageNo()we need to override this because we are overridingprocessPages(PDPageTree)intgetStartPage()static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page)voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)voidsetStartPage(int startPage)-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
-
-
-
-
Field Detail
-
XMP_DOCUMENT_CATALOG_LOCATION
public static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
- Constant Field Values
-
XMP_PAGE_LOCATION_PREFIX
public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
- See Also:
- Constant Field Values
-
-
Method Detail
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws org.xml.sax.SAXException, TikaExceptionConverts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument- PDF documenthandler- SAX content handlermetadata- PDF metadata- Throws:
org.xml.sax.SAXException- if the content handler fails to process SAX eventsTikaException- if there was an exception outside of per page processing
-
processPage
public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws java.io.IOException- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
java.io.IOException
-
getCurrentPageNo
public int getCurrentPageNo()
we need to override this because we are overridingprocessPages(PDPageTree)- Returns:
-
setStartBookmark
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setStartBookmarkin classorg.apache.pdfbox.text.PDFTextStripper
-
setEndBookmark
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setEndBookmarkin classorg.apache.pdfbox.text.PDFTextStripper
-
setStartPage
public void setStartPage(int startPage)
- Overrides:
setStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
getStartPage
public int getStartPage()
- Overrides:
getStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
-