Class TokenizerME

  • All Implemented Interfaces:
    Tokenizer

    public class TokenizerME
    extends java.lang.Object
    A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage: http://www.cis.upenn.edu/~jcreynar.

    This tokenizer needs a statistical model to tokenize a text which reproduces the tokenization observed in the training data used to create the model. The TokenizerModel class encapsulates the model and provides methods to create it from the binary representation.

    A tokenizer instance is not thread safe. For each thread one tokenizer must be instantiated which can share one TokenizerModel instance to safe memory.

    To train a new model {train(ObjectStream, TokenizerFactory, TrainingParameters) method can be used.

    Sample usage:

    InputStream modelIn;

    ...

    TokenizerModel model = TokenizerModel(modelIn);

    Tokenizer tokenizer = new TokenizerME(model);

    String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");

    See Also:
    Tokenizer, TokenizerModel, TokenSample
    • Field Detail

      • SPLIT

        public static final java.lang.String SPLIT
        Constant indicates a token split.
        See Also:
        Constant Field Values
      • NO_SPLIT

        public static final java.lang.String NO_SPLIT
        Constant indicates no token split.
        See Also:
        Constant Field Values
      • alphaNumeric

        @Deprecated
        public static final java.util.regex.Pattern alphaNumeric
        Deprecated.
        As of release 1.5.2, replaced by Factory.getAlphanumeric(String)
        Alpha-Numeric Pattern
    • Method Detail

      • getTokenProbabilities

        public double[] getTokenProbabilities()
        Returns the probabilities associated with the most recent calls to Tokenizer.tokenize(String) or tokenizePos(String).
        Returns:
        probability for each token returned for the most recent call to tokenize. If not applicable an empty array is returned.
      • tokenizePos

        public Span[] tokenizePos​(java.lang.String d)
        Tokenizes the string.
        Parameters:
        d - The string to be tokenized.
        Returns:
        A span array containing individual tokens as elements.
      • useAlphaNumericOptimization

        public boolean useAlphaNumericOptimization()
        Returns the value of the alpha-numeric optimization flag.
        Returns:
        true if the tokenizer should use alpha-numeric optimization, false otherwise.
      • tokenize

        public java.lang.String[] tokenize​(java.lang.String s)
        Description copied from interface: Tokenizer
        Splits a string into its atomic parts
        Specified by:
        tokenize in interface Tokenizer
        Parameters:
        s - The string to be tokenized.
        Returns:
        The String[] with the individual tokens as the array elements.