Package opennlp.tools.util
Class StringUtil
- java.lang.Object
-
- opennlp.tools.util.StringUtil
-
public class StringUtil extends java.lang.Object
-
-
Constructor Summary
Constructors Constructor Description StringUtil()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static void
computeShortestEditScript(java.lang.String wordForm, java.lang.String lemma, int[][] distance, java.lang.StringBuffer permutations)
Computes the Shortest Edit Script (SES) to convert a word into its lemma.static java.lang.String
decodeShortestEditScript(java.lang.String wordForm, java.lang.String permutations)
Read predicted SES by the lemmatizer model and apply the permutations to obtain the lemma from the wordForm.static java.lang.String
getShortestEditScript(java.lang.String wordForm, java.lang.String lemma)
Get the SES required to go from a word to a lemma.static boolean
isEmpty(java.lang.CharSequence theString)
Returnstrue
ifCharSequence.length()
is0
ornull
.static boolean
isWhitespace(char charCode)
Determines if the specified character is a whitespace.static boolean
isWhitespace(int charCode)
Determines if the specified character is a whitespace.static int[][]
levenshteinDistance(java.lang.String wordForm, java.lang.String lemma)
Computes the Levenshtein distance of two strings in a matrix.static java.lang.String
toLowerCase(java.lang.CharSequence string)
Converts to lower case independent of the current locale viaCharacter.toLowerCase(int)
which uses mapping information from the UnicodeData file.static java.lang.String
toUpperCase(java.lang.CharSequence string)
Converts to upper case independent of the current locale viaCharacter.toUpperCase(char)
which uses mapping information from the UnicodeData file.
-
-
-
Method Detail
-
isWhitespace
public static boolean isWhitespace(char charCode)
Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet:- Its a
Character.isWhitespace(int)
whitespace. - Its a part of the Unicode Zs category (
Character.SPACE_SEPARATOR
).
Character.isWhitespace(int)
does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameters:
charCode
-- Returns:
- true if white space otherwise false
- Its a
-
isWhitespace
public static boolean isWhitespace(int charCode)
Determines if the specified character is a whitespace. A character is considered a whitespace when one of the following conditions is meet:- Its a
Character.isWhitespace(int)
whitespace. - Its a part of the Unicode Zs category (
Character.SPACE_SEPARATOR
).
Character.isWhitespace(int)
does not include no-break spaces. In OpenNLP no-break spaces are also considered as white spaces.- Parameters:
charCode
-- Returns:
- true if white space otherwise false
- Its a
-
toLowerCase
public static java.lang.String toLowerCase(java.lang.CharSequence string)
Converts to lower case independent of the current locale viaCharacter.toLowerCase(int)
which uses mapping information from the UnicodeData file.- Parameters:
string
-- Returns:
- lower cased String
-
toUpperCase
public static java.lang.String toUpperCase(java.lang.CharSequence string)
Converts to upper case independent of the current locale viaCharacter.toUpperCase(char)
which uses mapping information from the UnicodeData file.- Parameters:
string
-- Returns:
- upper cased String
-
isEmpty
public static boolean isEmpty(java.lang.CharSequence theString)
Returnstrue
ifCharSequence.length()
is0
ornull
.- Returns:
true
ifCharSequence.length()
is0
, otherwisefalse
- Since:
- 1.5.1
-
levenshteinDistance
public static int[][] levenshteinDistance(java.lang.String wordForm, java.lang.String lemma)
Computes the Levenshtein distance of two strings in a matrix. Based on pseudo-code provided here: https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance which in turn is based on the paper Wagner, Robert A.; Fischer, Michael J. (1974), "The String-to-String Correction Problem", Journal of the ACM 21 (1): 168-173- Parameters:
wordForm
- the formlemma
- the lemma- Returns:
- the distance
-
computeShortestEditScript
public static void computeShortestEditScript(java.lang.String wordForm, java.lang.String lemma, int[][] distance, java.lang.StringBuffer permutations)
Computes the Shortest Edit Script (SES) to convert a word into its lemma. This is based on Chrupala's PhD thesis (2008).- Parameters:
wordForm
- the tokenlemma
- the target lemmadistance
- the levenshtein distancepermutations
- the number of permutations
-
decodeShortestEditScript
public static java.lang.String decodeShortestEditScript(java.lang.String wordForm, java.lang.String permutations)
Read predicted SES by the lemmatizer model and apply the permutations to obtain the lemma from the wordForm.- Parameters:
wordForm
- the wordFormpermutations
- the permutations predicted by the lemmatizer model- Returns:
- the lemma
-
getShortestEditScript
public static java.lang.String getShortestEditScript(java.lang.String wordForm, java.lang.String lemma)
Get the SES required to go from a word to a lemma.- Parameters:
wordForm
- the wordlemma
- the lemma- Returns:
- the shortest edit script
-
-