public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
XMP_DOCUMENT_CATALOG_LOCATION |
static java.lang.String |
XMP_PAGE_LOCATION_PREFIX |
Modifier and Type | Method and Description |
---|---|
int |
getCurrentPageNo()
we need to override this because we are overriding
processPages(PDPageTree) |
int |
getStartPage() |
static void |
process(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
org.xml.sax.ContentHandler handler,
ParseContext context,
Metadata metadata,
PDFParserConfig config)
Converts the given PDF document (and related metadata) to a stream
of XHTML SAX events sent to the given content handler.
|
void |
processPage(org.apache.pdfbox.pdmodel.PDPage page) |
void |
setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) |
void |
setStartPage(int startPage) |
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
addOperator, beginText, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getResources, getTextLineMatrix, getTextMatrix, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
public static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws org.xml.sax.SAXException, TikaException
pdDocument
- PDF documenthandler
- SAX content handlermetadata
- PDF metadataorg.xml.sax.SAXException
- if the content handler fails to process SAX eventsTikaException
- if there was an exception outside of per page processingpublic void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws java.io.IOException
processPage
in class org.apache.pdfbox.text.PDFTextStripper
java.io.IOException
public int getCurrentPageNo()
processPages(PDPageTree)
getCurrentPageNo
in class org.apache.pdfbox.text.PDFTextStripper
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setStartBookmark
in class org.apache.pdfbox.text.PDFTextStripper
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
setEndBookmark
in class org.apache.pdfbox.text.PDFTextStripper
public void setStartPage(int startPage)
setStartPage
in class org.apache.pdfbox.text.PDFTextStripper
public int getStartPage()
getStartPage
in class org.apache.pdfbox.text.PDFTextStripper
Copyright © 2010 - 2020 Adobe. All Rights Reserved