Class StandardAnalyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- org.apache.lucene.analysis.util.StopwordAnalyzerBase
-
- org.apache.lucene.analysis.standard.StandardAnalyzer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class StandardAnalyzer extends StopwordAnalyzerBase
FiltersStandardTokenizer
withStandardFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.You must specify the required
Version
compatibility when creating StandardAnalyzer:- As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
- As of 3.1, StandardTokenizer implements Unicode text segmentation,
and StopFilter correctly handles Unicode 4.0 supplementary characters
in stopwords.
ClassicTokenizer
andClassicAnalyzer
are the pre-3.1 implementations of StandardTokenizer and StandardAnalyzer. - As of 2.9, StopFilter preserves position increments
- As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1068)
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.GlobalReuseStrategy, Analyzer.PerFieldReuseStrategy, Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_MAX_TOKEN_LENGTH
Default maximum allowed token lengthstatic CharArraySet
STOP_WORDS_SET
An unmodifiable set containing some common English words that are usually not useful for searching.-
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
-
Constructor Summary
Constructors Constructor Description StandardAnalyzer(Version matchVersion)
Builds an analyzer with the default stop words (STOP_WORDS_SET
).StandardAnalyzer(Version matchVersion, java.io.Reader stopwords)
Builds an analyzer with the stop words from the given reader.StandardAnalyzer(Version matchVersion, CharArraySet stopWords)
Builds an analyzer with the given stop words.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getMaxTokenLength()
void
setMaxTokenLength(int length)
Set maximum allowed token length.-
Methods inherited from class org.apache.lucene.analysis.util.StopwordAnalyzerBase
getStopwordSet
-
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, tokenStream, tokenStream
-
-
-
-
Field Detail
-
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTH
Default maximum allowed token length- See Also:
- Constant Field Values
-
STOP_WORDS_SET
public static final CharArraySet STOP_WORDS_SET
An unmodifiable set containing some common English words that are usually not useful for searching.
-
-
Constructor Detail
-
StandardAnalyzer
public StandardAnalyzer(Version matchVersion, CharArraySet stopWords)
Builds an analyzer with the given stop words.- Parameters:
matchVersion
- Lucene version to match See {@link above}stopWords
- stop words
-
StandardAnalyzer
public StandardAnalyzer(Version matchVersion)
Builds an analyzer with the default stop words (STOP_WORDS_SET
).- Parameters:
matchVersion
- Lucene version to match See {@link above}
-
StandardAnalyzer
public StandardAnalyzer(Version matchVersion, java.io.Reader stopwords) throws java.io.IOException
Builds an analyzer with the stop words from the given reader.- Parameters:
matchVersion
- Lucene version to match See {@link above}stopwords
- Reader to read stop words from- Throws:
java.io.IOException
- See Also:
WordlistLoader.getWordSet(Reader, Version)
-
-
Method Detail
-
setMaxTokenLength
public void setMaxTokenLength(int length)
Set maximum allowed token length. If a token is seen that exceeds this length then it is discarded. This setting only takes effect the next time tokenStream or tokenStream is called.
-
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
setMaxTokenLength(int)
-
-