Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
-
- org.apache.poi.extractor.POITextExtractor
-
- org.apache.poi.extractor.POIOLE2TextExtractor
-
- org.apache.poi.hwpf.extractor.WordExtractor
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
-
-
Constructor Summary
Constructors Constructor Description WordExtractor(java.io.InputStream is)
Create a new Word ExtractorWordExtractor(HWPFDocument doc)
Create a new Word ExtractorWordExtractor(DirectoryNode dir)
WordExtractor(POIFSFileSystem fs)
Create a new Word Extractor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description java.lang.String[]
getCommentsText()
java.lang.String[]
getEndnoteText()
java.lang.String
getFooterText()
Deprecated.3.8 beta 4java.lang.String[]
getFootnoteText()
java.lang.String
getHeaderText()
Deprecated.3.8 beta 4java.lang.String[]
getMainTextboxText()
java.lang.String[]
getParagraphText()
Get the text from the word file, as an array with one String per paragraphjava.lang.String
getText()
Grab the text, based on the WordToTextConverter.java.lang.String
getTextFromPieces()
Grab the text out of the text pieces.static void
main(java.lang.String[] args)
Command line extractor, so people will stop moaning that they can't just run this.static java.lang.String
stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.-
Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation
-
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem
-
-
-
-
Constructor Detail
-
WordExtractor
public WordExtractor(java.io.InputStream is) throws java.io.IOException
Create a new Word Extractor- Parameters:
is
- InputStream containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws java.io.IOException
Create a new Word Extractor- Parameters:
fs
- POIFSFileSystem containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(DirectoryNode dir) throws java.io.IOException
- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
doc
- The HWPFDocument to extract from
-
-
Method Detail
-
main
public static void main(java.lang.String[] args) throws java.io.IOException
Command line extractor, so people will stop moaning that they can't just run this.- Throws:
java.io.IOException
-
getParagraphText
public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph
-
getFootnoteText
public java.lang.String[] getFootnoteText()
-
getMainTextboxText
public java.lang.String[] getMainTextboxText()
-
getEndnoteText
public java.lang.String[] getEndnoteText()
-
getCommentsText
public java.lang.String[] getCommentsText()
-
getHeaderText
@Deprecated public java.lang.String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers
-
getFooterText
@Deprecated public java.lang.String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers
-
getTextFromPieces
public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
-
getText
public java.lang.String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getText
in classPOITextExtractor
- Returns:
- All the text from the document
-
stripFields
public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.
-
-