Class PatternAnalyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- org.apache.lucene.analysis.miscellaneous.PatternAnalyzer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
@Deprecated public final class PatternAnalyzer extends Analyzer
Deprecated.(4.0) use the pattern-based analysis in the analysis/pattern package instead.Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than aReader
, that can flexibly separate text into terms via a regular expressionPattern
(with behaviour identical toString.split(String)
), and that combines the functionality ofLetterTokenizer
,LowerCaseTokenizer
,WhitespaceTokenizer
,StopFilter
into a single efficient multi-purpose class.If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via
String.split(String)
. Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene
TokenFilter
chain. For example as in this stemming example:PatternAnalyzer pat = ... TokenStream tokenStream = new SnowballFilter( pat.tokenStream("content", "James is running round in the woods"), "English"));
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.GlobalReuseStrategy, Analyzer.PerFieldReuseStrategy, Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
-
Field Summary
Fields Modifier and Type Field Description static PatternAnalyzer
DEFAULT_ANALYZER
Deprecated.A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.static PatternAnalyzer
EXTENDED_ANALYZER
Deprecated.A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader.static java.util.regex.Pattern
NON_WORD_PATTERN
Deprecated."\\W+"
; Divides text at non-letters (NOT Character.isLetter(c))static java.util.regex.Pattern
WHITESPACE_PATTERN
Deprecated."\\s+"
; Divides text at whitespaces (Character.isWhitespace(c))-
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
-
Constructor Summary
Constructors Constructor Description PatternAnalyzer(Version matchVersion, java.util.regex.Pattern pattern, boolean toLowerCase, CharArraySet stopWords)
Deprecated.Constructs a new instance with the given parameters.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description Analyzer.TokenStreamComponents
createComponents(java.lang.String fieldName, java.io.Reader reader)
Deprecated.Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards totokenStream(String, Reader, String)
and is less efficient thantokenStream(String, Reader, String)
.Analyzer.TokenStreamComponents
createComponents(java.lang.String fieldName, java.io.Reader reader, java.lang.String text)
Deprecated.Creates a token stream that tokenizes the given string into token terms (aka words).boolean
equals(java.lang.Object other)
Deprecated.Indicates whether some other object is "equal to" this one.int
hashCode()
Deprecated.Returns a hash code value for the object.-
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, tokenStream, tokenStream
-
-
-
-
Field Detail
-
NON_WORD_PATTERN
public static final java.util.regex.Pattern NON_WORD_PATTERN
Deprecated."\\W+"
; Divides text at non-letters (NOT Character.isLetter(c))
-
WHITESPACE_PATTERN
public static final java.util.regex.Pattern WHITESPACE_PATTERN
Deprecated."\\s+"
; Divides text at whitespaces (Character.isWhitespace(c))
-
DEFAULT_ANALYZER
public static final PatternAnalyzer DEFAULT_ANALYZER
Deprecated.A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
-
EXTENDED_ANALYZER
public static final PatternAnalyzer EXTENDED_ANALYZER
Deprecated.A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
-
-
Constructor Detail
-
PatternAnalyzer
public PatternAnalyzer(Version matchVersion, java.util.regex.Pattern pattern, boolean toLowerCase, CharArraySet stopWords)
Deprecated.Constructs a new instance with the given parameters.- Parameters:
matchVersion
- currently does nothingpattern
- a regular expression delimiting tokenstoLowerCase
- iftrue
returns tokens after applying String.toLowerCase()stopWords
- if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created viaStopFilter.makeStopSet(Version, String[])
and/orWordlistLoader
as inWordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words lists .
-
-
Method Detail
-
createComponents
public Analyzer.TokenStreamComponents createComponents(java.lang.String fieldName, java.io.Reader reader, java.lang.String text)
Deprecated.Creates a token stream that tokenizes the given string into token terms (aka words).- Parameters:
fieldName
- the name of the field to tokenize (currently ignored).reader
- reader (e.g. charfilter) of the original text. can be null.text
- the string to tokenize- Returns:
- a new token stream
-
createComponents
public Analyzer.TokenStreamComponents createComponents(java.lang.String fieldName, java.io.Reader reader)
Deprecated.Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards totokenStream(String, Reader, String)
and is less efficient thantokenStream(String, Reader, String)
.- Parameters:
fieldName
- the name of the field to tokenize (currently ignored).reader
- the reader delivering the text- Returns:
- a new token stream
-
equals
public boolean equals(java.lang.Object other)
Deprecated.Indicates whether some other object is "equal to" this one.- Overrides:
equals
in classjava.lang.Object
- Parameters:
other
- the reference object with which to compare.- Returns:
- true if equal, false otherwise
-
hashCode
public int hashCode()
Deprecated.Returns a hash code value for the object.- Overrides:
hashCode
in classjava.lang.Object
- Returns:
- the hash code.
-
-