Class PatternAnalyzer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    @Deprecated
    public final class PatternAnalyzer
    extends Analyzer
    Deprecated.
    (4.0) use the pattern-based analysis in the analysis/pattern package instead.
    Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a Reader, that can flexibly separate text into terms via a regular expression Pattern (with behaviour identical to String.split(String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.

    If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via String.split(String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Java Regular Expression Tutorial.

    This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene TokenFilter chain. For example as in this stemming example:

     PatternAnalyzer pat = ...
     TokenStream tokenStream = new SnowballFilter(
         pat.tokenStream("content", "James is running round in the woods"), 
         "English"));
     
    • Field Detail

      • NON_WORD_PATTERN

        public static final java.util.regex.Pattern NON_WORD_PATTERN
        Deprecated.
        "\\W+"; Divides text at non-letters (NOT Character.isLetter(c))
      • WHITESPACE_PATTERN

        public static final java.util.regex.Pattern WHITESPACE_PATTERN
        Deprecated.
        "\\s+"; Divides text at whitespaces (Character.isWhitespace(c))
      • DEFAULT_ANALYZER

        public static final PatternAnalyzer DEFAULT_ANALYZER
        Deprecated.
        A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
      • EXTENDED_ANALYZER

        public static final PatternAnalyzer EXTENDED_ANALYZER
        Deprecated.
        A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
    • Constructor Detail

      • PatternAnalyzer

        public PatternAnalyzer​(Version matchVersion,
                               java.util.regex.Pattern pattern,
                               boolean toLowerCase,
                               CharArraySet stopWords)
        Deprecated.
        Constructs a new instance with the given parameters.
        Parameters:
        matchVersion - currently does nothing
        pattern - a regular expression delimiting tokens
        toLowerCase - if true returns tokens after applying String.toLowerCase()
        stopWords - if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via StopFilter.makeStopSet(Version, String[])and/or WordlistLoaderas in WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt") or other stop words lists .
    • Method Detail

      • createComponents

        public Analyzer.TokenStreamComponents createComponents​(java.lang.String fieldName,
                                                               java.io.Reader reader,
                                                               java.lang.String text)
        Deprecated.
        Creates a token stream that tokenizes the given string into token terms (aka words).
        Parameters:
        fieldName - the name of the field to tokenize (currently ignored).
        reader - reader (e.g. charfilter) of the original text. can be null.
        text - the string to tokenize
        Returns:
        a new token stream
      • createComponents

        public Analyzer.TokenStreamComponents createComponents​(java.lang.String fieldName,
                                                               java.io.Reader reader)
        Deprecated.
        Creates a token stream that tokenizes all the text in the given Reader; This implementation forwards to tokenStream(String, Reader, String) and is less efficient than tokenStream(String, Reader, String).
        Parameters:
        fieldName - the name of the field to tokenize (currently ignored).
        reader - the reader delivering the text
        Returns:
        a new token stream
      • equals

        public boolean equals​(java.lang.Object other)
        Deprecated.
        Indicates whether some other object is "equal to" this one.
        Overrides:
        equals in class java.lang.Object
        Parameters:
        other - the reference object with which to compare.
        Returns:
        true if equal, false otherwise
      • hashCode

        public int hashCode()
        Deprecated.
        Returns a hash code value for the object.
        Overrides:
        hashCode in class java.lang.Object
        Returns:
        the hash code.