Class NGramUtils


  • public class NGramUtils
    extends java.lang.Object
    Utility class for ngrams. Some methods apply specifically to certain 'n' values, for e.g. tri/bi/uni-grams.
    • Constructor Summary

      Constructors 
      Constructor Description
      NGramUtils()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static double calculateBigramMLProbability​(java.lang.String x0, java.lang.String x1, java.util.Collection<StringList> set)
      calculate the probability of a bigram in a vocabulary using maximum likelihood estimation
      static double calculateBigramPriorSmoothingProbability​(java.lang.String x0, java.lang.String x1, java.util.Collection<StringList> set, java.lang.Double k)
      calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithm
      static double calculateLaplaceSmoothingProbability​(StringList ngram, java.lang.Iterable<StringList> set, java.lang.Double k)
      calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithm
      static double calculateMissingNgramProbabilityMass​(StringList ngram, java.lang.Double discount, java.lang.Iterable<StringList> set)
      calculate the probability of a ngram in a vocabulary using the missing probability mass algorithm
      static double calculateNgramMLProbability​(StringList ngram, java.lang.Iterable<StringList> set)
      calculate the probability of a ngram in a vocabulary using maximum likelihood estimation
      static double calculateTrigramLinearInterpolationProbability​(java.lang.String x0, java.lang.String x1, java.lang.String x2, java.util.Collection<StringList> set, java.lang.Double lambda1, java.lang.Double lambda2, java.lang.Double lambda3)
      calculate the probability of a trigram in a vocabulary using a linear interpolation algorithm
      static double calculateTrigramMLProbability​(java.lang.String x0, java.lang.String x1, java.lang.String x2, java.lang.Iterable<StringList> set)
      calculate the probability of a trigram in a vocabulary using maximum likelihood estimation
      static double calculateUnigramMLProbability​(java.lang.String word, java.util.Collection<StringList> set)
      calculate the probability of a unigram in a vocabulary using maximum likelihood estimation
      static java.util.Collection<java.lang.String[]> getNGrams​(java.lang.String[] sequence, int size)
      Get the ngrams of dimension n of a certain input sequence of tokens.
      static java.util.Collection<StringList> getNGrams​(StringList sequence, int size)
      Get the ngrams of dimension n of a certain input sequence of tokens.
      static StringList getNMinusOneTokenFirst​(StringList ngram)
      get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngram
      static StringList getNMinusOneTokenLast​(StringList ngram)
      get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • NGramUtils

        public NGramUtils()
    • Method Detail

      • calculateLaplaceSmoothingProbability

        public static double calculateLaplaceSmoothingProbability​(StringList ngram,
                                                                  java.lang.Iterable<StringList> set,
                                                                  java.lang.Double k)
        calculate the probability of a ngram in a vocabulary using Laplace smoothing algorithm
        Parameters:
        ngram - the ngram to get the probability for
        set - the vocabulary
        k - the smoothing factor
        Returns:
        the Laplace smoothing probability
        See Also:
        Additive Smoothing
      • calculateUnigramMLProbability

        public static double calculateUnigramMLProbability​(java.lang.String word,
                                                           java.util.Collection<StringList> set)
        calculate the probability of a unigram in a vocabulary using maximum likelihood estimation
        Parameters:
        word - the only word in the unigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateBigramMLProbability

        public static double calculateBigramMLProbability​(java.lang.String x0,
                                                          java.lang.String x1,
                                                          java.util.Collection<StringList> set)
        calculate the probability of a bigram in a vocabulary using maximum likelihood estimation
        Parameters:
        x0 - first word in the bigram
        x1 - second word in the bigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateTrigramMLProbability

        public static double calculateTrigramMLProbability​(java.lang.String x0,
                                                           java.lang.String x1,
                                                           java.lang.String x2,
                                                           java.lang.Iterable<StringList> set)
        calculate the probability of a trigram in a vocabulary using maximum likelihood estimation
        Parameters:
        x0 - first word in the trigram
        x1 - second word in the trigram
        x2 - third word in the trigram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateNgramMLProbability

        public static double calculateNgramMLProbability​(StringList ngram,
                                                         java.lang.Iterable<StringList> set)
        calculate the probability of a ngram in a vocabulary using maximum likelihood estimation
        Parameters:
        ngram - a ngram
        set - the vocabulary
        Returns:
        the maximum likelihood probability
      • calculateBigramPriorSmoothingProbability

        public static double calculateBigramPriorSmoothingProbability​(java.lang.String x0,
                                                                      java.lang.String x1,
                                                                      java.util.Collection<StringList> set,
                                                                      java.lang.Double k)
        calculate the probability of a bigram in a vocabulary using prior Laplace smoothing algorithm
        Parameters:
        x0 - the first word in the bigram
        x1 - the second word in the bigram
        set - the vocabulary
        k - the smoothing factor
        Returns:
        the prior Laplace smoothiing probability
      • calculateTrigramLinearInterpolationProbability

        public static double calculateTrigramLinearInterpolationProbability​(java.lang.String x0,
                                                                            java.lang.String x1,
                                                                            java.lang.String x2,
                                                                            java.util.Collection<StringList> set,
                                                                            java.lang.Double lambda1,
                                                                            java.lang.Double lambda2,
                                                                            java.lang.Double lambda3)
        calculate the probability of a trigram in a vocabulary using a linear interpolation algorithm
        Parameters:
        x0 - the first word in the trigram
        x1 - the second word in the trigram
        x2 - the third word in the trigram
        set - the vocabulary
        lambda1 - trigram interpolation factor
        lambda2 - bigram interpolation factor
        lambda3 - unigram interpolation factor
        Returns:
        the linear interpolation probability
      • calculateMissingNgramProbabilityMass

        public static double calculateMissingNgramProbabilityMass​(StringList ngram,
                                                                  java.lang.Double discount,
                                                                  java.lang.Iterable<StringList> set)
        calculate the probability of a ngram in a vocabulary using the missing probability mass algorithm
        Parameters:
        ngram - the ngram
        discount - discount factor
        set - the vocabulary
        Returns:
        the probability
      • getNMinusOneTokenFirst

        public static StringList getNMinusOneTokenFirst​(StringList ngram)
        get the (n-1)th ngram of a given ngram, that is the same ngram except the last word in the ngram
        Parameters:
        ngram - a ngram
        Returns:
        a ngram
      • getNMinusOneTokenLast

        public static StringList getNMinusOneTokenLast​(StringList ngram)
        get the (n-1)th ngram of a given ngram, that is the same ngram except the first word in the ngram
        Parameters:
        ngram - a ngram
        Returns:
        a ngram
      • getNGrams

        public static java.util.Collection<StringList> getNGrams​(StringList sequence,
                                                                 int size)
        Get the ngrams of dimension n of a certain input sequence of tokens.
        Parameters:
        sequence - a sequence of tokens
        size - the size of the resulting ngrmams
        Returns:
        all the possible ngrams of the given size derivable from the input sequence
      • getNGrams

        public static java.util.Collection<java.lang.String[]> getNGrams​(java.lang.String[] sequence,
                                                                         int size)
        Get the ngrams of dimension n of a certain input sequence of tokens.
        Parameters:
        sequence - a sequence of tokens
        size - the size of the resulting ngrmams
        Returns:
        all the possible ngrams of the given size derivable from the input sequence