Package org.apache.lucene.analysis.util
Class CharacterUtils
- java.lang.Object
-
- org.apache.lucene.analysis.util.CharacterUtils
-
public abstract class CharacterUtils extends java.lang.Object
CharacterUtils
provides a unified interface to Character-related operations to implement backwards compatible character operations based on aVersion
instance.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CharacterUtils.CharacterBuffer
A simple IO buffer to use withfill(CharacterBuffer, Reader)
.
-
Constructor Summary
Constructors Constructor Description CharacterUtils()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract int
codePointAt(char[] chars, int offset, int limit)
Returns the code point at the given index of the char array where only elements with index less than the limit are used.abstract int
codePointAt(java.lang.CharSequence seq, int offset)
Returns the code point at the given index of theCharSequence
.abstract int
codePointCount(java.lang.CharSequence seq)
Return the number of characters inseq
.boolean
fill(CharacterUtils.CharacterBuffer buffer, java.io.Reader reader)
Convenience method which callsfill(buffer, reader, buffer.buffer.length)
.abstract boolean
fill(CharacterUtils.CharacterBuffer buffer, java.io.Reader reader, int numChars)
Fills theCharacterUtils.CharacterBuffer
with characters read from the given readerReader
.static CharacterUtils
getInstance(Version matchVersion)
Returns aCharacterUtils
implementation according to the givenVersion
instance.static CharacterUtils
getJava4Instance()
Return aCharacterUtils
instance compatible with Java 1.4.static CharacterUtils.CharacterBuffer
newCharacterBuffer(int bufferSize)
Creates a newCharacterUtils.CharacterBuffer
and allocates achar[]
of the given bufferSize.abstract int
offsetByCodePoints(char[] buf, int start, int count, int index, int offset)
Return the index withinbuf[start:start+count]
which is byoffset
code points fromindex
.int
toChars(int[] src, int srcOff, int srcLen, char[] dest, int destOff)
Converts a sequence of unicode code points to a sequence of Java characters.int
toCodePoints(char[] src, int srcOff, int srcLen, int[] dest, int destOff)
Converts a sequence of Java characters to a sequence of unicode code points.void
toLowerCase(char[] buffer, int offset, int limit)
Converts each unicode codepoint to lowerCase viaCharacter.toLowerCase(int)
starting at the given offset.void
toUpperCase(char[] buffer, int offset, int limit)
Converts each unicode codepoint to UpperCase viaCharacter.toUpperCase(int)
starting at the given offset.
-
-
-
Method Detail
-
getInstance
public static CharacterUtils getInstance(Version matchVersion)
Returns aCharacterUtils
implementation according to the givenVersion
instance.- Parameters:
matchVersion
- a version instance- Returns:
- a
CharacterUtils
implementation according to the givenVersion
instance.
-
getJava4Instance
public static CharacterUtils getJava4Instance()
Return aCharacterUtils
instance compatible with Java 1.4.
-
codePointAt
public abstract int codePointAt(java.lang.CharSequence seq, int offset)
Returns the code point at the given index of theCharSequence
. Depending on theVersion
passed togetInstance(Version)
this method mimics the behavior ofCharacter.codePointAt(char[], int)
as it would have been available on a Java 1.4 JVM or on a later virtual machine version.- Parameters:
seq
- a character sequenceoffset
- the offset to the char values in the chars array to be converted- Returns:
- the Unicode code point at the given index
- Throws:
java.lang.NullPointerException
- - if the sequence is null.java.lang.IndexOutOfBoundsException
- - if the value offset is negative or not less than the length of the character sequence.
-
codePointAt
public abstract int codePointAt(char[] chars, int offset, int limit)
Returns the code point at the given index of the char array where only elements with index less than the limit are used. Depending on theVersion
passed togetInstance(Version)
this method mimics the behavior ofCharacter.codePointAt(char[], int)
as it would have been available on a Java 1.4 JVM or on a later virtual machine version.- Parameters:
chars
- a character arrayoffset
- the offset to the char values in the chars array to be convertedlimit
- the index afer the last element that should be used to calculate codepoint.- Returns:
- the Unicode code point at the given index
- Throws:
java.lang.NullPointerException
- - if the array is null.java.lang.IndexOutOfBoundsException
- - if the value offset is negative or not less than the length of the char array.
-
codePointCount
public abstract int codePointCount(java.lang.CharSequence seq)
Return the number of characters inseq
.
-
newCharacterBuffer
public static CharacterUtils.CharacterBuffer newCharacterBuffer(int bufferSize)
Creates a newCharacterUtils.CharacterBuffer
and allocates achar[]
of the given bufferSize.- Parameters:
bufferSize
- the internal char buffer size, must be>= 2
- Returns:
- a new
CharacterUtils.CharacterBuffer
instance.
-
toLowerCase
public final void toLowerCase(char[] buffer, int offset, int limit)
Converts each unicode codepoint to lowerCase viaCharacter.toLowerCase(int)
starting at the given offset.- Parameters:
buffer
- the char buffer to lowercaseoffset
- the offset to start atlimit
- the max char in the buffer to lower case
-
toUpperCase
public final void toUpperCase(char[] buffer, int offset, int limit)
Converts each unicode codepoint to UpperCase viaCharacter.toUpperCase(int)
starting at the given offset.- Parameters:
buffer
- the char buffer to UPPERCASEoffset
- the offset to start atlimit
- the max char in the buffer to lower case
-
toCodePoints
public final int toCodePoints(char[] src, int srcOff, int srcLen, int[] dest, int destOff)
Converts a sequence of Java characters to a sequence of unicode code points.- Returns:
- the number of code points written to the destination buffer
-
toChars
public final int toChars(int[] src, int srcOff, int srcLen, char[] dest, int destOff)
Converts a sequence of unicode code points to a sequence of Java characters.- Returns:
- the number of chars written to the destination buffer
-
fill
public abstract boolean fill(CharacterUtils.CharacterBuffer buffer, java.io.Reader reader, int numChars) throws java.io.IOException
Fills theCharacterUtils.CharacterBuffer
with characters read from the given readerReader
. This method tries to readnumChars
characters into theCharacterUtils.CharacterBuffer
, each call to fill will start filling the buffer from offset0
up tonumChars
. In case code points can span across 2 java characters, this method may only fillnumChars - 1
characters in order not to split in the middle of a surrogate pair, even if there are remaining characters in theReader
.Depending on the
Version
passed togetInstance(Version)
this method implements supplementary character awareness when filling the given buffer. For allVersion
> 3.0fill(CharacterBuffer, Reader, int)
guarantees that the givenCharacterUtils.CharacterBuffer
will never contain a high surrogate character as the last element in the buffer unless it is the last available character in the reader. In other words, high and low surrogate pairs will always be preserved across buffer boarders.A return value of
false
means that this method call exhausted the reader, but there may be some bytes which have been read, which can be verified by checking whetherbuffer.getLength() > 0
.- Parameters:
buffer
- the buffer to fill.reader
- the reader to read characters from.numChars
- the number of chars to read- Returns:
false
if and only if reader.read returned -1 while trying to fill the buffer- Throws:
java.io.IOException
- if the reader throws anIOException
.
-
fill
public final boolean fill(CharacterUtils.CharacterBuffer buffer, java.io.Reader reader) throws java.io.IOException
Convenience method which callsfill(buffer, reader, buffer.buffer.length)
.- Throws:
java.io.IOException
-
offsetByCodePoints
public abstract int offsetByCodePoints(char[] buf, int start, int count, int index, int offset)
Return the index withinbuf[start:start+count]
which is byoffset
code points fromindex
.
-
-