Class CodePageUtil
- java.lang.Object
-
- org.apache.poi.util.CodePageUtil
-
public class CodePageUtil extends java.lang.Object
Utilities for working with Microsoft CodePages.Provides constants for understanding numeric codepages, along with utilities to translate these into Java Character Sets.
-
-
Field Summary
Fields Modifier and Type Field Description static int
CP_037
Codepage 037, a special casestatic int
CP_EUC_JP
Codepage for EUC-JPstatic int
CP_EUC_KR
Codepage for EUC-KRstatic int
CP_GB18030
Codepage for GB18030static int
CP_GB2312
Codepage for GB2312static int
CP_GBK
Codepage for GBK, aka MS936static int
CP_ISO_2022_JP1
Codepage for ISO-2022-JPstatic int
CP_ISO_2022_JP2
Another codepage for ISO-2022-JPstatic int
CP_ISO_2022_JP3
Yet another codepage for ISO-2022-JPstatic int
CP_ISO_2022_KR
Codepage for ISO-2022-KRstatic int
CP_ISO_8859_1
Codepage for ISO-8859-1static int
CP_ISO_8859_2
Codepage for ISO-8859-2static int
CP_ISO_8859_3
Codepage for ISO-8859-3static int
CP_ISO_8859_4
Codepage for ISO-8859-4static int
CP_ISO_8859_5
Codepage for ISO-8859-5static int
CP_ISO_8859_6
Codepage for ISO-8859-6static int
CP_ISO_8859_7
Codepage for ISO-8859-7static int
CP_ISO_8859_8
Codepage for ISO-8859-8static int
CP_ISO_8859_9
Codepage for ISO-8859-9static int
CP_JOHAB
Codepage for Johabstatic int
CP_KOI8_R
Codepage for KOI8-Rstatic int
CP_MAC_ARABIC
Codepage for Macintosh Arabic (Java: MacArabic)static int
CP_MAC_CENTRAL_EUROPE
Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)static int
CP_MAC_CHINESE_SIMPLE
Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)static int
CP_MAC_CHINESE_TRADITIONAL
Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)static int
CP_MAC_CROATIAN
Codepage for Macintosh Croatian (Java: MacCroatian)static int
CP_MAC_CYRILLIC
Codepage for Macintosh Cyrillic (Java: MacCyrillic)static int
CP_MAC_GREEK
Codepage for Macintosh Greek (Java: MacGreek)static int
CP_MAC_HEBREW
Codepage for Macintosh Hebrew (Java: MacHebrew)static int
CP_MAC_ICELAND
Codepage for Macintosh Iceland (Java: MacIceland)static int
CP_MAC_JAPAN
Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)static int
CP_MAC_KOREAN
Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)static int
CP_MAC_ROMAN
Codepage for Macintosh Roman (Java: MacRoman)static int
CP_MAC_ROMAN_BIFF23
static int
CP_MAC_ROMANIA
Codepage for Macintosh Romanian (Java: MacRomania)static int
CP_MAC_THAI
Codepage for Macintosh Thai (Java: MacThai)static int
CP_MAC_TURKISH
Codepage for Macintosh Turkish (Java: MacTurkish)static int
CP_MAC_UKRAINE
Codepage for Macintosh Ukrainian (Java: MacUkraine)static int
CP_MS949
Codepage for MS949static int
CP_SJIS
Codepage for SJISstatic int
CP_UNICODE
Codepage for Unicodestatic int
CP_US_ACSII
Codepage for US-ASCIIstatic int
CP_US_ASCII2
Another codepage for US-ASCIIstatic int
CP_UTF16
Codepage for UTF-16static int
CP_UTF16_BE
Codepage for UTF-16 big-endianstatic int
CP_UTF8
Codepage for UTF-8static int
CP_WINDOWS_1250
Codepage for Windows 1250static int
CP_WINDOWS_1251
Codepage for Windows 1251static int
CP_WINDOWS_1252
Codepage for Windows 1252static int
CP_WINDOWS_1252_BIFF23
static int
CP_WINDOWS_1253
Codepage for Windows 1253static int
CP_WINDOWS_1254
Codepage for Windows 1254static int
CP_WINDOWS_1255
Codepage for Windows 1255static int
CP_WINDOWS_1256
Codepage for Windows 1256static int
CP_WINDOWS_1257
Codepage for Windows 1257static int
CP_WINDOWS_1258
Codepage for Windows 1258static java.util.Set<java.nio.charset.Charset>
DOUBLE_BYTE_CHARSETS
-
Constructor Summary
Constructors Constructor Description CodePageUtil()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.String
codepageToEncoding(int codepage)
Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).static java.lang.String
codepageToEncoding(int codepage, boolean javaLangFormat)
Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.static java.lang.String
cp950ToString(byte[] data, int offset, int lengthInBytes)
This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String.static byte[]
getBytesInCodePage(java.lang.String string, int codepage)
Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.static java.lang.String
getStringFromCodePage(byte[] string, int codepage)
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.static java.lang.String
getStringFromCodePage(byte[] string, int offset, int length, int codepage)
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.
-
-
-
Field Detail
-
DOUBLE_BYTE_CHARSETS
public static final java.util.Set<java.nio.charset.Charset> DOUBLE_BYTE_CHARSETS
-
CP_037
public static final int CP_037
Codepage 037, a special case
- See Also:
- Constant Field Values
-
CP_SJIS
public static final int CP_SJIS
Codepage for SJIS
- See Also:
- Constant Field Values
-
CP_GBK
public static final int CP_GBK
Codepage for GBK, aka MS936
- See Also:
- Constant Field Values
-
CP_MS949
public static final int CP_MS949
Codepage for MS949
- See Also:
- Constant Field Values
-
CP_UTF16
public static final int CP_UTF16
Codepage for UTF-16
- See Also:
- Constant Field Values
-
CP_UTF16_BE
public static final int CP_UTF16_BE
Codepage for UTF-16 big-endian
- See Also:
- Constant Field Values
-
CP_WINDOWS_1250
public static final int CP_WINDOWS_1250
Codepage for Windows 1250
- See Also:
- Constant Field Values
-
CP_WINDOWS_1251
public static final int CP_WINDOWS_1251
Codepage for Windows 1251
- See Also:
- Constant Field Values
-
CP_WINDOWS_1252
public static final int CP_WINDOWS_1252
Codepage for Windows 1252
- See Also:
- Constant Field Values
-
CP_WINDOWS_1252_BIFF23
public static final int CP_WINDOWS_1252_BIFF23
- See Also:
- Constant Field Values
-
CP_WINDOWS_1253
public static final int CP_WINDOWS_1253
Codepage for Windows 1253
- See Also:
- Constant Field Values
-
CP_WINDOWS_1254
public static final int CP_WINDOWS_1254
Codepage for Windows 1254
- See Also:
- Constant Field Values
-
CP_WINDOWS_1255
public static final int CP_WINDOWS_1255
Codepage for Windows 1255
- See Also:
- Constant Field Values
-
CP_WINDOWS_1256
public static final int CP_WINDOWS_1256
Codepage for Windows 1256
- See Also:
- Constant Field Values
-
CP_WINDOWS_1257
public static final int CP_WINDOWS_1257
Codepage for Windows 1257
- See Also:
- Constant Field Values
-
CP_WINDOWS_1258
public static final int CP_WINDOWS_1258
Codepage for Windows 1258
- See Also:
- Constant Field Values
-
CP_JOHAB
public static final int CP_JOHAB
Codepage for Johab
- See Also:
- Constant Field Values
-
CP_MAC_ROMAN
public static final int CP_MAC_ROMAN
Codepage for Macintosh Roman (Java: MacRoman)
- See Also:
- Constant Field Values
-
CP_MAC_ROMAN_BIFF23
public static final int CP_MAC_ROMAN_BIFF23
- See Also:
- Constant Field Values
-
CP_MAC_JAPAN
public static final int CP_MAC_JAPAN
Codepage for Macintosh Japan (Java: unknown - use SJIS, cp942 or cp943)
- See Also:
- Constant Field Values
-
CP_MAC_CHINESE_TRADITIONAL
public static final int CP_MAC_CHINESE_TRADITIONAL
Codepage for Macintosh Chinese Traditional (Java: unknown - use Big5, MS950, or cp937)
- See Also:
- Constant Field Values
-
CP_MAC_KOREAN
public static final int CP_MAC_KOREAN
Codepage for Macintosh Korean (Java: unknown - use EUC_KR or cp949)
- See Also:
- Constant Field Values
-
CP_MAC_ARABIC
public static final int CP_MAC_ARABIC
Codepage for Macintosh Arabic (Java: MacArabic)
- See Also:
- Constant Field Values
-
CP_MAC_HEBREW
public static final int CP_MAC_HEBREW
Codepage for Macintosh Hebrew (Java: MacHebrew)
- See Also:
- Constant Field Values
-
CP_MAC_GREEK
public static final int CP_MAC_GREEK
Codepage for Macintosh Greek (Java: MacGreek)
- See Also:
- Constant Field Values
-
CP_MAC_CYRILLIC
public static final int CP_MAC_CYRILLIC
Codepage for Macintosh Cyrillic (Java: MacCyrillic)
- See Also:
- Constant Field Values
-
CP_MAC_CHINESE_SIMPLE
public static final int CP_MAC_CHINESE_SIMPLE
Codepage for Macintosh Chinese Simplified (Java: unknown - use EUC_CN, ISO2022_CN_GB, MS936 or cp935)
- See Also:
- Constant Field Values
-
CP_MAC_ROMANIA
public static final int CP_MAC_ROMANIA
Codepage for Macintosh Romanian (Java: MacRomania)
- See Also:
- Constant Field Values
-
CP_MAC_UKRAINE
public static final int CP_MAC_UKRAINE
Codepage for Macintosh Ukrainian (Java: MacUkraine)
- See Also:
- Constant Field Values
-
CP_MAC_THAI
public static final int CP_MAC_THAI
Codepage for Macintosh Thai (Java: MacThai)
- See Also:
- Constant Field Values
-
CP_MAC_CENTRAL_EUROPE
public static final int CP_MAC_CENTRAL_EUROPE
Codepage for Macintosh Central Europe (Latin-2) (Java: MacCentralEurope)
- See Also:
- Constant Field Values
-
CP_MAC_ICELAND
public static final int CP_MAC_ICELAND
Codepage for Macintosh Iceland (Java: MacIceland)
- See Also:
- Constant Field Values
-
CP_MAC_TURKISH
public static final int CP_MAC_TURKISH
Codepage for Macintosh Turkish (Java: MacTurkish)
- See Also:
- Constant Field Values
-
CP_MAC_CROATIAN
public static final int CP_MAC_CROATIAN
Codepage for Macintosh Croatian (Java: MacCroatian)
- See Also:
- Constant Field Values
-
CP_US_ACSII
public static final int CP_US_ACSII
Codepage for US-ASCII
- See Also:
- Constant Field Values
-
CP_KOI8_R
public static final int CP_KOI8_R
Codepage for KOI8-R
- See Also:
- Constant Field Values
-
CP_ISO_8859_1
public static final int CP_ISO_8859_1
Codepage for ISO-8859-1
- See Also:
- Constant Field Values
-
CP_ISO_8859_2
public static final int CP_ISO_8859_2
Codepage for ISO-8859-2
- See Also:
- Constant Field Values
-
CP_ISO_8859_3
public static final int CP_ISO_8859_3
Codepage for ISO-8859-3
- See Also:
- Constant Field Values
-
CP_ISO_8859_4
public static final int CP_ISO_8859_4
Codepage for ISO-8859-4
- See Also:
- Constant Field Values
-
CP_ISO_8859_5
public static final int CP_ISO_8859_5
Codepage for ISO-8859-5
- See Also:
- Constant Field Values
-
CP_ISO_8859_6
public static final int CP_ISO_8859_6
Codepage for ISO-8859-6
- See Also:
- Constant Field Values
-
CP_ISO_8859_7
public static final int CP_ISO_8859_7
Codepage for ISO-8859-7
- See Also:
- Constant Field Values
-
CP_ISO_8859_8
public static final int CP_ISO_8859_8
Codepage for ISO-8859-8
- See Also:
- Constant Field Values
-
CP_ISO_8859_9
public static final int CP_ISO_8859_9
Codepage for ISO-8859-9
- See Also:
- Constant Field Values
-
CP_ISO_2022_JP1
public static final int CP_ISO_2022_JP1
Codepage for ISO-2022-JP
- See Also:
- Constant Field Values
-
CP_ISO_2022_JP2
public static final int CP_ISO_2022_JP2
Another codepage for ISO-2022-JP
- See Also:
- Constant Field Values
-
CP_ISO_2022_JP3
public static final int CP_ISO_2022_JP3
Yet another codepage for ISO-2022-JP
- See Also:
- Constant Field Values
-
CP_ISO_2022_KR
public static final int CP_ISO_2022_KR
Codepage for ISO-2022-KR
- See Also:
- Constant Field Values
-
CP_EUC_JP
public static final int CP_EUC_JP
Codepage for EUC-JP
- See Also:
- Constant Field Values
-
CP_EUC_KR
public static final int CP_EUC_KR
Codepage for EUC-KR
- See Also:
- Constant Field Values
-
CP_GB2312
public static final int CP_GB2312
Codepage for GB2312
- See Also:
- Constant Field Values
-
CP_GB18030
public static final int CP_GB18030
Codepage for GB18030
- See Also:
- Constant Field Values
-
CP_US_ASCII2
public static final int CP_US_ASCII2
Another codepage for US-ASCII
- See Also:
- Constant Field Values
-
CP_UTF8
public static final int CP_UTF8
Codepage for UTF-8
- See Also:
- Constant Field Values
-
CP_UNICODE
public static final int CP_UNICODE
Codepage for Unicode
- See Also:
- Constant Field Values
-
-
Method Detail
-
getBytesInCodePage
public static byte[] getBytesInCodePage(java.lang.String string, int codepage) throws java.io.UnsupportedEncodingException
Converts a string into bytes, in the equivalent character encoding to the supplied codepage number.- Parameters:
string
- The string to convertcodepage
- The codepage number- Throws:
java.io.UnsupportedEncodingException
-
getStringFromCodePage
public static java.lang.String getStringFromCodePage(byte[] string, int codepage) throws java.io.UnsupportedEncodingException
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.- Parameters:
string
- The byte of the string to convertcodepage
- The codepage number- Throws:
java.io.UnsupportedEncodingException
-
getStringFromCodePage
public static java.lang.String getStringFromCodePage(byte[] string, int offset, int length, int codepage) throws java.io.UnsupportedEncodingException
Converts the bytes into a String, based on the equivalent character encoding to the supplied codepage number.- Parameters:
string
- The byte of the string to convertcodepage
- The codepage number- Throws:
java.io.UnsupportedEncodingException
-
codepageToEncoding
public static java.lang.String codepageToEncoding(int codepage) throws java.io.UnsupportedEncodingException
Turns a codepage number into the equivalent character encoding's name (in Java NIO canonical naming format).
- Parameters:
codepage
- The codepage number- Returns:
- The character encoding's name. If the codepage number is 65001, the encoding name is "UTF-8". All other positive numbers are mapped to their Java NIO names, normally either "windows-" followed by the number, eg "windows-1251", or "cp" followed by the number, e.g. if the codepage number is 1252 the returned character encoding name will be "cp1252".
- Throws:
java.io.UnsupportedEncodingException
- if the specified codepage is less than zero.
-
codepageToEncoding
public static java.lang.String codepageToEncoding(int codepage, boolean javaLangFormat) throws java.io.UnsupportedEncodingException
Turns a codepage number into the equivalent character encoding's name, in either Java NIO or Java Lang canonical naming.
- Parameters:
codepage
- The codepage numberjavaLangFormat
- Should Java Lang or Java NIO naming be used?- Returns:
- The character encoding's name, in either Java Lang format (eg Cp1251, ISO8859_5) or Java NIO format (eg windows-1252, ISO-8859-9)
- Throws:
java.io.UnsupportedEncodingException
- if the specified codepage is less than zero.- See Also:
- Supported Encodings
-
cp950ToString
public static java.lang.String cp950ToString(byte[] data, int offset, int lengthInBytes)
This tries to convert a LE byte array in cp950 (Microsoft's dialect of Big5) to a String. We know MS zero-padded ascii, and we drop those. There may be areas for improvement in this.- Parameters:
data
-offset
-lengthInBytes
-- Returns:
- Decoded String
-
-