OfficeParserConfig (The Adobe AEM Quickstart and Web Application.)

java.lang.Object
- org.apache.tika.parser.microsoft.OfficeParserConfig

All Implemented Interfaces:: java.io.Serializable

public class OfficeParserConfig
extends java.lang.Object
implements java.io.Serializable

See Also:: Serialized Form

Constructor Summary

Constructors
Constructor and Description

OfficeParserConfig()

Constructors
Constructor and Description
`OfficeParserConfig()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`getConcatenatePhoneticRuns()`
`java.lang.String`	`getDateFormatOverride()`
`boolean`	`getExtractAllAlternativesFromMSG()`
`boolean`	`getExtractMacros()`
`boolean`	`getIncludeDeletedContent()`
`boolean`	`getIncludeHeadersAndFooters()`
`boolean`	`getIncludeMissingRows()`
`boolean`	`getIncludeMoveFromContent()`
`boolean`	`getIncludeShapeBasedContent()`
`boolean`	`getIncludeSlideMasterContent()`
`boolean`	`getIncludeSlideNotes()`
`boolean`	`getUseSAXDocxExtractor()`
`boolean`	`getUseSAXPptxExtractor()`
`void`	`setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)` Microsoft Excel files can sometimes contain phonetic (furigana) strings.
`void`	`setDateOverrideFormat(java.lang.String format)` A user may wish to override the date formats in xls and xlsx files.
`void`	`setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)` Some .msg files can contain body content in html, rtf and/or text.
`void`	`setExtractMacros(boolean extractMacros)` Sets whether or not MSOffice parsers should extract macros.
`void`	`setIncludeDeletedContent(boolean includeDeletedContent)` Sets whether or not the parser should include deleted content.
`void`	`setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)` Whether or not to include headers and footers.
`void`	`setIncludeMissingRows(boolean includeMissingRows)` For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved.
`void`	`setIncludeMoveFromContent(boolean includeMoveFromContent)` With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section.
`void`	`setIncludeShapeBasedContent(boolean includeShapeBasedContent)` In Excel and Word, there can be text stored within drawing shapes.
`void`	`setIncludeSlideMasterContent(boolean includeSlideMasterContent)` Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file.
`void`	`setIncludeSlideNotes(boolean includeSlideNotes)` Whether or not to process slide notes content.
`void`	`setUseSAXDocxExtractor(boolean useSAXDocxExtractor)` Use the experimental SAX-based streaming DOCX parser? If set to `false`, the classic parser will be used; if `true`, the new experimental parser will be used.
`void`	`setUseSAXPptxExtractor(boolean useSAXPptxExtractor)` Use the experimental SAX-based streaming DOCX parser? If set to `false`, the classic parser will be used; if `true`, the new experimental parser will be used.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - OfficeParserConfig
```
public OfficeParserConfig()
```
- Method Detail
  - setExtractMacros
```
public void setExtractMacros(boolean extractMacros)
```
    Sets whether or not MSOffice parsers should extract macros. As of Tika 1.15, the default is false.
    
    Parameters:
    
    extractMacros -
  - getExtractMacros
```
public boolean getExtractMacros()
```
    Returns:
    
    whether or not to extract macros
  - setIncludeDeletedContent
```
public void setIncludeDeletedContent(boolean includeDeletedContent)
```
    Sets whether or not the parser should include deleted content.
    This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator so far!!!
    
    Parameters:
    
    includeDeletedContent -
  - getIncludeDeletedContent
```
public boolean getIncludeDeletedContent()
```
  - setIncludeMoveFromContent
```
public void setIncludeMoveFromContent(boolean includeMoveFromContent)
```
    With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section.
    If you'd like to include the section both in its original location (moveFrom) and in its new location (moveTo), set this to true
    Default: false
    This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator so far!!!
    
    Parameters:
    
    includeMoveFromContent -
  - getIncludeMoveFromContent
```
public boolean getIncludeMoveFromContent()
```
  - setIncludeShapeBasedContent
```
public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
```
    In Excel and Word, there can be text stored within drawing shapes. (In PowerPoint everything is in a Shape)
    If you'd like to skip processing these to look for text, set this to false
    Default: true
    
    Parameters:
    
    includeShapeBasedContent -
  - getIncludeShapeBasedContent
```
public boolean getIncludeShapeBasedContent()
```
  - setIncludeHeadersAndFooters
```
public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
```
    Whether or not to include headers and footers.
    This only operates on headers and footers in Word and Excel, not master slide content in Powerpoint.
    Default: true
    
    Parameters:
    
    includeHeadersAndFooters -
  - getIncludeHeadersAndFooters
```
public boolean getIncludeHeadersAndFooters()
```
  - getUseSAXDocxExtractor
```
public boolean getUseSAXDocxExtractor()
```
  - setUseSAXDocxExtractor
```
public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
```
    Use the experimental SAX-based streaming DOCX parser? If set to false, the classic parser will be used; if true, the new experimental parser will be used.
    Default: false (classic DOM parser)
    
    Parameters:
    
    useSAXDocxExtractor -
  - setUseSAXPptxExtractor
```
public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
```
    Use the experimental SAX-based streaming DOCX parser? If set to false, the classic parser will be used; if true, the new experimental parser will be used.
    Default: false (classic DOM parser)
    
    Parameters:
    
    useSAXPptxExtractor -
  - getUseSAXPptxExtractor
```
public boolean getUseSAXPptxExtractor()
```
  - getConcatenatePhoneticRuns
```
public boolean getConcatenatePhoneticRuns()
```
  - setConcatenatePhoneticRuns
```
public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
```
    Microsoft Excel files can sometimes contain phonetic (furigana) strings. See PHONETIC. This sets whether or not the parser will concatenate the phonetic runs to the original text.
    This is currently only supported by the xls and xlsx parsers (not the xlsb parser), and the default is true.
    
    Parameters:
    
    concatenatePhoneticRuns -
  - setExtractAllAlternativesFromMSG
```
public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
```
    Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.
    
    Parameters:
    
    extractAllAlternativesFromMSG - whether or not to extract all alternative parts
    
    Since:
    
    1.17
  - getExtractAllAlternativesFromMSG
```
public boolean getExtractAllAlternativesFromMSG()
```
  - setIncludeMissingRows
```
public void setIncludeMissingRows(boolean includeMissingRows)
```
    For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved.
  - getIncludeMissingRows
```
public boolean getIncludeMissingRows()
```
  - getIncludeSlideNotes
```
public boolean getIncludeSlideNotes()
```
  - setIncludeSlideNotes
```
public void setIncludeSlideNotes(boolean includeSlideNotes)
```
    Whether or not to process slide notes content. If set to false, the parser will skip the text content and all embedded objects from the slide notes in ppt and ppt[xm]. The default is true.
    
    Parameters:
    
    includeSlideNotes - whether or not to process slide notes
    
    Since:
    
    1.19.1
  - getIncludeSlideMasterContent
```
public boolean getIncludeSlideMasterContent()
```
    Returns:
    
    whether or not to process content in slide masters
    
    Since:
    
    1.19.1
  - setIncludeSlideMasterContent
```
public void setIncludeSlideMasterContent(boolean includeSlideMasterContent)
```
    Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file. If set to false, the parser will not extract text or embedded objects from any of the masters.
    
    Parameters:
    
    includeSlideMasterContent -
    
    Since:
    
    1.19.1
  - getDateFormatOverride
```
public java.lang.String getDateFormatOverride()
```
  - setDateOverrideFormat
```
public void setDateOverrideFormat(java.lang.String format)
```
    A user may wish to override the date formats in xls and xlsx files. For example, a user might prefer 'yyyy-mm-dd' to 'mm/dd/yy'. Note: these formats are "Excel formats" not Java's SimpleDateFormat
    
    Parameters:
    
    format -

Class OfficeParserConfig

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

OfficeParserConfig

Method Detail

setExtractMacros

getExtractMacros

setIncludeDeletedContent

getIncludeDeletedContent

setIncludeMoveFromContent

getIncludeMoveFromContent

setIncludeShapeBasedContent

getIncludeShapeBasedContent

setIncludeHeadersAndFooters

getIncludeHeadersAndFooters

getUseSAXDocxExtractor

setUseSAXDocxExtractor

setUseSAXPptxExtractor

getUseSAXPptxExtractor

getConcatenatePhoneticRuns

setConcatenatePhoneticRuns

setExtractAllAlternativesFromMSG

getExtractAllAlternativesFromMSG

setIncludeMissingRows

getIncludeMissingRows

getIncludeSlideNotes

setIncludeSlideNotes

getIncludeSlideMasterContent

setIncludeSlideMasterContent

getDateFormatOverride

setDateOverrideFormat