Interface Tokenizer
- 
- All Known Implementing Classes:
 SimpleTokenizer,TokenizerME,WhitespaceTokenizer
public interface TokenizerThe interface for tokenizers, which segment a string into its tokens.Tokenization is a necessary step before more complex NLP tasks can be applied, these usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.
In segmented languages like English most words are segmented by white spaces expect for punctuations, etc. which is directly attached to the word without a white space in between, it is not possible to just split at all punctuations because in abbreviations dots are a part of the token itself. A tokenizer is now responsible to split these tokens correctly.
In non-segmented languages like Chinese tokenization is more difficult since words are not segmented by a whitespace.
Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.
For most further task it is desirable to over tokenize rather than under tokenize.
 
- 
- 
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description java.lang.String[]tokenize(java.lang.String s)Splits a string into its atomic partsSpan[]tokenizePos(java.lang.String s)Finds the boundaries of atomic parts in a string. 
 - 
 
- 
- 
Method Detail
- 
tokenize
java.lang.String[] tokenize(java.lang.String s)
Splits a string into its atomic parts- Parameters:
 s- The string to be tokenized.- Returns:
 - The String[] with the individual tokens as the array elements.
 
 
- 
tokenizePos
Span[] tokenizePos(java.lang.String s)
Finds the boundaries of atomic parts in a string.- Parameters:
 s- The string to be tokenized.- Returns:
 - The Span[] with the spans (offsets into s) for each token as the individuals array elements.
 
 
 - 
 
 -