Class UCharacter
- java.lang.Object
-
- com.adobe.agl.lang.UCharacter
-
- All Implemented Interfaces:
UCharacterEnums.ECharacterCategory
,UCharacterEnums.ECharacterDirection
public final class UCharacter extends java.lang.Object implements UCharacterEnums.ECharacterCategory, UCharacterEnums.ECharacterDirection
The UCharacter class provides extensions to the java.lang.Character class. These extensions provide support for more Unicode properties and together with the UTF16 class, provide support for supplementary characters (those with code points above U+FFFF). Each ICU release supports the latest version of Unicode available at that time.
Code points are represented in these API using ints. While it would be more convenient in Java to have a separate primitive datatype for them, ints suffice in the meantime.
To use this class please add the jar file name icu4j.jar to the class path, since it contains data files which supply the information used by this file.
E.g. In Windows
set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar
.
Otherwise, another method would be to copy the files uprops.dat and unames.icu from the icu4j source subdirectory $ICU4J_SRC/src/com.adobe.agl.impl.data to your class directory $ICU4J_CLASS/com.adobe.agl.impl.data.Aside from the additions for UTF-16 support, and the updated Unicode properties, the main differences between UCharacter and Character are:
- UCharacter is not designed to be a char wrapper and does not have
APIs to which involves management of that single char.
These include:- char charValue(),
- int compareTo(java.lang.Character, java.lang.Character), etc.
- UCharacter does not include Character APIs that are deprecated, nor does it include the Java-specific character information, such as boolean isJavaIdentifierPart(char ch).
- Character maps characters 'A' - 'Z' and 'a' - 'z' to the numeric values '10' - '35'. UCharacter also does this in digit and getNumericValue, to adhere to the java semantics of these methods. New methods unicodeDigit, and getUnicodeNumericValue do not treat the above code points as having numeric values. This is a semantic change from ICU4J 1.3.1.
Further detail differences can be determined from the program com.adobe.agl.dev.test.lang.UCharacterCompare
In addition to Java compatibility functions, which calculate derived properties, this API provides low-level access to the Unicode Character Database.
Unicode assigns each code point (not just assigned character) values for many properties. Most of them are simple boolean flags, or constants from a small enumerated list. For some properties, values are strings or other relatively more complex types.
For more information see "About the Unicode Character Database" (http://www.unicode.org/ucd/) and the ICU User Guide chapter on Properties (http://www.icu-project.org/userguide/properties.html).
There are also functions that provide easy migration from C/POSIX functions like isblank(). Their use is generally discouraged because the C/POSIX standards do not define their semantics beyond the ASCII range, which means that different implementations exhibit very different behavior. Instead, Unicode properties should be used directly.
There are also only a few, broad C/POSIX character classes, and they tend to be used for conflicting purposes. For example, the "isalpha()" class is sometimes used to determine word boundaries, while a more sophisticated approach would at least distinguish initial letters from continuation characters (the latter including combining marks). (In ICU, BreakIterator is the most sophisticated API for word boundaries.) Another example: There is no "istitle()" class for titlecase characters.
ICU 3.4 and later provides API access for all twelve C/POSIX character classes. ICU implements them according to the Standard Recommendations in Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/#Compatibility_Properties).
API access for C/POSIX character classes is as follows: - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC) - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE) - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE) - punct: ((1<
The C/POSIX character classes are also available in UnicodeSet patterns, using patterns like [:graph:] or \p{graph}.
Note: There are several ICU (and Java) whitespace functions. Comparison: - isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property; most of general categories "Z" (separators) + most whitespace ISO controls (including no-break spaces, but excluding IS1..IS4 and ZWSP) - isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces - isSpaceChar: just Z (including no-break spaces)
This class is not subclassable
- See Also:
UCharacterEnums
-
-
Field Summary
Fields Modifier and Type Field Description static int
MAX_VALUE
The highest Unicode code point value (scalar value) according to the Unicode Standard.static int
MIN_VALUE
The lowest Unicode code point value.-
Fields inherited from interface com.adobe.agl.lang.UCharacterEnums.ECharacterCategory
CHAR_CATEGORY_COUNT, COMBINING_SPACING_MARK, CONNECTOR_PUNCTUATION, CONTROL, CURRENCY_SYMBOL, DASH_PUNCTUATION, DECIMAL_DIGIT_NUMBER, ENCLOSING_MARK, END_PUNCTUATION, FINAL_PUNCTUATION, FINAL_QUOTE_PUNCTUATION, FORMAT, GENERAL_OTHER_TYPES, INITIAL_PUNCTUATION, INITIAL_QUOTE_PUNCTUATION, LETTER_NUMBER, LINE_SEPARATOR, LOWERCASE_LETTER, MATH_SYMBOL, MODIFIER_LETTER, MODIFIER_SYMBOL, NON_SPACING_MARK, OTHER_LETTER, OTHER_NUMBER, OTHER_PUNCTUATION, OTHER_SYMBOL, PARAGRAPH_SEPARATOR, PRIVATE_USE, SPACE_SEPARATOR, START_PUNCTUATION, SURROGATE, TITLECASE_LETTER, UNASSIGNED, UPPERCASE_LETTER
-
Fields inherited from interface com.adobe.agl.lang.UCharacterEnums.ECharacterDirection
ARABIC_NUMBER, BLOCK_SEPARATOR, BOUNDARY_NEUTRAL, CHAR_DIRECTION_COUNT, COMMON_NUMBER_SEPARATOR, DIR_NON_SPACING_MARK, DIRECTIONALITY_ARABIC_NUMBER, DIRECTIONALITY_BOUNDARY_NEUTRAL, DIRECTIONALITY_COMMON_NUMBER_SEPARATOR, DIRECTIONALITY_EUROPEAN_NUMBER, DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR, DIRECTIONALITY_EUROPEAN_NUMBER_TERMINATOR, DIRECTIONALITY_LEFT_TO_RIGHT, DIRECTIONALITY_LEFT_TO_RIGHT_EMBEDDING, DIRECTIONALITY_LEFT_TO_RIGHT_OVERRIDE, DIRECTIONALITY_NONSPACING_MARK, DIRECTIONALITY_OTHER_NEUTRALS, DIRECTIONALITY_PARAGRAPH_SEPARATOR, DIRECTIONALITY_POP_DIRECTIONAL_FORMAT, DIRECTIONALITY_RIGHT_TO_LEFT, DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC, DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING, DIRECTIONALITY_RIGHT_TO_LEFT_OVERRIDE, DIRECTIONALITY_SEGMENT_SEPARATOR, DIRECTIONALITY_UNDEFINED, DIRECTIONALITY_WHITESPACE, EUROPEAN_NUMBER, EUROPEAN_NUMBER_SEPARATOR, EUROPEAN_NUMBER_TERMINATOR, LEFT_TO_RIGHT, LEFT_TO_RIGHT_EMBEDDING, LEFT_TO_RIGHT_OVERRIDE, OTHER_NEUTRAL, POP_DIRECTIONAL_FORMAT, RIGHT_TO_LEFT, RIGHT_TO_LEFT_ARABIC, RIGHT_TO_LEFT_EMBEDDING, RIGHT_TO_LEFT_OVERRIDE, SEGMENT_SEPARATOR, WHITE_SPACE_NEUTRAL
-
-
Constructor Summary
Constructors Constructor Description UCharacter()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static int
getMirror(int ch)
-
-
-
Field Detail
-
MIN_VALUE
public static final int MIN_VALUE
The lowest Unicode code point value.- See Also:
- Constant Field Values
-
MAX_VALUE
public static final int MAX_VALUE
The highest Unicode code point value (scalar value) according to the Unicode Standard. This is a 21-bit value (21 bits, rounded up).
Up-to-date Unicode implementation of java.lang.Character.MIN_VALUE- See Also:
- Constant Field Values
-
-