Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
 - 
- org.apache.poi.extractor.POITextExtractor
 - 
- org.apache.poi.extractor.POIOLE2TextExtractor
 - 
- org.apache.poi.hwpf.extractor.WordExtractor
 
 
 
 
- 
- All Implemented Interfaces:
 java.io.Closeable,java.lang.AutoCloseable
public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise. 
- 
- 
Constructor Summary
Constructors Constructor Description WordExtractor(java.io.InputStream is)Create a new Word ExtractorWordExtractor(HWPFDocument doc)Create a new Word ExtractorWordExtractor(DirectoryNode dir)WordExtractor(POIFSFileSystem fs)Create a new Word Extractor 
- 
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description java.lang.String[]getCommentsText()java.lang.String[]getEndnoteText()java.lang.StringgetFooterText()Deprecated.3.8 beta 4java.lang.String[]getFootnoteText()java.lang.StringgetHeaderText()Deprecated.3.8 beta 4java.lang.String[]getMainTextboxText()java.lang.String[]getParagraphText()Get the text from the word file, as an array with one String per paragraphjava.lang.StringgetText()Grab the text, based on the WordToTextConverter.java.lang.StringgetTextFromPieces()Grab the text out of the text pieces.static voidmain(java.lang.String[] args)Command line extractor, so people will stop moaning that they can't just run this.static java.lang.StringstripFields(java.lang.String text)Removes any fields (eg macros, page markers etc) from the string.- 
Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation 
- 
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem 
 - 
 
 - 
 
- 
- 
Constructor Detail
- 
WordExtractor
public WordExtractor(java.io.InputStream is) throws java.io.IOExceptionCreate a new Word Extractor- Parameters:
 is- InputStream containing the word file- Throws:
 java.io.IOException
 
- 
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws java.io.IOException
Create a new Word Extractor- Parameters:
 fs- POIFSFileSystem containing the word file- Throws:
 java.io.IOException
 
- 
WordExtractor
public WordExtractor(DirectoryNode dir) throws java.io.IOException
- Throws:
 java.io.IOException
 
- 
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
 doc- The HWPFDocument to extract from
 
 - 
 
- 
Method Detail
- 
main
public static void main(java.lang.String[] args) throws java.io.IOExceptionCommand line extractor, so people will stop moaning that they can't just run this.- Throws:
 java.io.IOException
 
- 
getParagraphText
public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph 
- 
getFootnoteText
public java.lang.String[] getFootnoteText()
 
- 
getMainTextboxText
public java.lang.String[] getMainTextboxText()
 
- 
getEndnoteText
public java.lang.String[] getEndnoteText()
 
- 
getCommentsText
public java.lang.String[] getCommentsText()
 
- 
getHeaderText
@Deprecated public java.lang.String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers 
- 
getFooterText
@Deprecated public java.lang.String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers 
- 
getTextFromPieces
public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too. 
- 
getText
public java.lang.String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
 getTextin classPOITextExtractor- Returns:
 - All the text from the document
 
 
- 
stripFields
public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string. 
 - 
 
 -