public class TokenizerME
extends java.lang.Object
This tokenizer needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel
class encapsulates the model and provides
methods to create it from the binary representation.
A tokenizer instance is not thread safe. For each thread one tokenizer
must be instantiated which can share one TokenizerModel
instance
to safe memory.
To train a new model {train(String, ObjectStream, boolean, TrainingParameters)
method
can be used.
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
Tokenizer
,
TokenizerModel
,
TokenSample
Modifier and Type | Field and Description |
---|---|
static java.util.regex.Pattern |
alphaNumeric
Deprecated.
As of release 1.5.2, replaced by
Factory.getAlphanumeric(String) |
static java.lang.String |
NO_SPLIT
Constant indicates no token split.
|
static java.lang.String |
SPLIT
Constant indicates a token split.
|
Constructor and Description |
---|
TokenizerME(TokenizerModel model) |
TokenizerME(TokenizerModel model,
Factory factory)
Deprecated.
use
TokenizerFactory to extend the Tokenizer
functionality |
Modifier and Type | Method and Description |
---|---|
double[] |
getTokenProbabilities()
Returns the probabilities associated with the most recent
calls to
AbstractTokenizer.tokenize(String) or tokenizePos(String) . |
java.lang.String[] |
tokenize(java.lang.String s)
Splits a string into its atomic parts
|
Span[] |
tokenizePos(java.lang.String d)
Tokenizes the string.
|
static TokenizerModel |
train(ObjectStream<TokenSample> samples,
TokenizerFactory factory,
TrainingParameters mlParams)
Trains a model for the
TokenizerME . |
static TokenizerModel |
train(java.lang.String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization)
Deprecated.
Use
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
static TokenizerModel |
train(java.lang.String languageCode,
ObjectStream<TokenSample> samples,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
Deprecated.
Use
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
static TokenizerModel |
train(java.lang.String languageCode,
ObjectStream<TokenSample> samples,
Dictionary abbreviations,
boolean useAlphaNumericOptimization,
TrainingParameters mlParams)
Deprecated.
Use
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory |
boolean |
useAlphaNumericOptimization()
Returns the value of the alpha-numeric optimization flag.
|
public static final java.lang.String SPLIT
public static final java.lang.String NO_SPLIT
@Deprecated public static final java.util.regex.Pattern alphaNumeric
Factory.getAlphanumeric(String)
public TokenizerME(TokenizerModel model)
public TokenizerME(TokenizerModel model, Factory factory)
TokenizerFactory
to extend the Tokenizer
functionalitypublic double[] getTokenProbabilities()
AbstractTokenizer.tokenize(String)
or tokenizePos(String)
.public Span[] tokenizePos(java.lang.String d)
d
- The string to be tokenized.public static TokenizerModel train(ObjectStream<TokenSample> samples, TokenizerFactory factory, TrainingParameters mlParams) throws java.io.IOException
TokenizerME
.samples
- the samples used for the training.factory
- a TokenizerFactory
to get resources frommlParams
- the machine learning train parametersTokenizerModel
java.io.IOException
- it throws an IOException
if an IOException
is
thrown during IO operations on a temp file which is created
during training. Or if reading from the ObjectStream
fails.public static TokenizerModel train(java.lang.String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization, TrainingParameters mlParams) throws java.io.IOException
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
TokenizerME
.languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skippedmlParams
- the machine learning train parametersTokenizerModel
java.io.IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.public static TokenizerModel train(java.lang.String languageCode, ObjectStream<TokenSample> samples, Dictionary abbreviations, boolean useAlphaNumericOptimization, TrainingParameters mlParams) throws java.io.IOException
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
TokenizerME
.languageCode
- the language of the natural textsamples
- the samples used for the training.abbreviations
- an abbreviations dictionaryuseAlphaNumericOptimization
- - if true alpha numerics are skippedmlParams
- the machine learning train parametersTokenizerModel
java.io.IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which is created during training.
Or if reading from the ObjectStream
fails.public static TokenizerModel train(java.lang.String languageCode, ObjectStream<TokenSample> samples, boolean useAlphaNumericOptimization) throws java.io.IOException, java.io.ObjectStreamException
train(ObjectStream, TokenizerFactory, TrainingParameters)
and pass in a TokenizerFactory
TokenizerME
with a default cutoff of 5 and 100 iterations.languageCode
- the language of the natural textsamples
- the samples used for the training.useAlphaNumericOptimization
- - if true alpha numerics are skippedTokenizerModel
java.io.IOException
- it throws an IOException
if an IOException
is thrown during IO operations on a temp file which isjava.io.ObjectStreamException
- if reading from the ObjectStream
fails
created during training.public boolean useAlphaNumericOptimization()
public java.lang.String[] tokenize(java.lang.String s)
Tokenizer
Copyright © 2010 - 2020 Adobe. All Rights Reserved