Class DictionaryCompoundWordTokenFilter

  • All Implemented Interfaces:, java.lang.AutoCloseable

    public class DictionaryCompoundWordTokenFilter
    extends CompoundWordTokenFilterBase
    A TokenFilter that decomposes compound words found in many Germanic languages.

    "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

    You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
    • Constructor Detail

      • DictionaryCompoundWordTokenFilter

        public DictionaryCompoundWordTokenFilter​(Version matchVersion,
                                                 TokenStream input,
                                                 CharArraySet dictionary,
                                                 int minWordSize,
                                                 int minSubwordSize,
                                                 int maxSubwordSize,
                                                 boolean onlyLongestMatch)
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        dictionary - the word dictionary to match against.
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream