Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
-
public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
XMP_DOCUMENT_CATALOG_LOCATION
static java.lang.String
XMP_PAGE_LOCATION_PREFIX
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getCurrentPageNo()
we need to override this because we are overridingprocessPages(PDPageTree)
int
getStartPage()
static void
process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.void
processPage(org.apache.pdfbox.pdmodel.PDPage page)
void
setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
void
setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
void
setStartPage(int startPage)
-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
-
-
-
-
Field Detail
-
XMP_DOCUMENT_CATALOG_LOCATION
public static final java.lang.String XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
- Constant Field Values
-
XMP_PAGE_LOCATION_PREFIX
public static final java.lang.String XMP_PAGE_LOCATION_PREFIX
- See Also:
- Constant Field Values
-
-
Method Detail
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, org.xml.sax.ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws org.xml.sax.SAXException, TikaException
Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument
- PDF documenthandler
- SAX content handlermetadata
- PDF metadata- Throws:
org.xml.sax.SAXException
- if the content handler fails to process SAX eventsTikaException
- if there was an exception outside of per page processing
-
processPage
public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws java.io.IOException
- Overrides:
processPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
java.io.IOException
-
getCurrentPageNo
public int getCurrentPageNo()
we need to override this because we are overridingprocessPages(PDPageTree)
- Returns:
-
setStartBookmark
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setStartBookmark
in classorg.apache.pdfbox.text.PDFTextStripper
-
setEndBookmark
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
- Overrides:
setEndBookmark
in classorg.apache.pdfbox.text.PDFTextStripper
-
setStartPage
public void setStartPage(int startPage)
- Overrides:
setStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
getStartPage
public int getStartPage()
- Overrides:
getStartPage
in classorg.apache.pdfbox.text.PDFTextStripper
-
-