Package org.apache.lucene.util
Class UnicodeUtil
- java.lang.Object
-
- org.apache.lucene.util.UnicodeUtil
-
public final class UnicodeUtil extends Object
Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes(StandardCharsets.UTF_8) does.- NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classUnicodeUtil.UTF8CodePointHolds a codepoint along with the number of bytes required to represent it in UTF8
-
Field Summary
Fields Modifier and Type Field Description static BytesRefBIG_TERMA binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g.static intMAX_UTF8_BYTES_PER_CHARMaximum number of UTF8 bytes per UTF16 character.static intUNI_REPLACEMENT_CHARstatic intUNI_SUR_HIGH_ENDstatic intUNI_SUR_HIGH_STARTstatic intUNI_SUR_LOW_ENDstatic intUNI_SUR_LOW_START
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static intcalcUTF16toUTF8Length(CharSequence s, int offset, int len)Calculates the number of UTF8 bytes necessary to write a UTF16 string.static UnicodeUtil.UTF8CodePointcodePointAt(byte[] utf8, int pos, UnicodeUtil.UTF8CodePoint reuse)Computes the codepoint and codepoint length (in bytes) of the specifiedoffsetin the providedutf8byte array, assuming UTF8 encoding.static intcodePointCount(BytesRef utf8)Returns the number of code points in this UTF8 sequence.static intmaxUTF8Length(int utf16Length)Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)static StringnewString(int[] codePoints, int offset, int count)Cover JDK 1.5 API.static StringtoHexString(String s)static intUTF16toUTF8(char[] source, int offset, int length, byte[] out)Encode characters from a char[] source, starting at offset for length chars.static intUTF16toUTF8(CharSequence s, int offset, int length, byte[] out)Encode characters from this String, starting at offset for length characters.static intUTF16toUTF8(CharSequence s, int offset, int length, byte[] out, int outOffset)Encode characters from this String, starting at offset for length characters.static intUTF8toUTF16(byte[] utf8, int offset, int length, char[] out)Interprets the given byte array as UTF-8 and converts to UTF-16.static intUTF8toUTF16(BytesRef bytesRef, char[] chars)Utility method forUTF8toUTF16(byte[], int, int, char[])static intUTF8toUTF32(BytesRef utf8, int[] ints)This method assumes valid UTF8 input.static booleanvalidUTF16String(char[] s, int size)static booleanvalidUTF16String(CharSequence s)
-
-
-
Field Detail
-
BIG_TERM
public static final BytesRef BIG_TERM
A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.WARNING: This is not a valid UTF8 Term
-
UNI_SUR_HIGH_START
public static final int UNI_SUR_HIGH_START
- See Also:
- Constant Field Values
-
UNI_SUR_HIGH_END
public static final int UNI_SUR_HIGH_END
- See Also:
- Constant Field Values
-
UNI_SUR_LOW_START
public static final int UNI_SUR_LOW_START
- See Also:
- Constant Field Values
-
UNI_SUR_LOW_END
public static final int UNI_SUR_LOW_END
- See Also:
- Constant Field Values
-
UNI_REPLACEMENT_CHAR
public static final int UNI_REPLACEMENT_CHAR
- See Also:
- Constant Field Values
-
MAX_UTF8_BYTES_PER_CHAR
public static final int MAX_UTF8_BYTES_PER_CHAR
Maximum number of UTF8 bytes per UTF16 character.- See Also:
- Constant Field Values
-
-
Method Detail
-
UTF16toUTF8
public static int UTF16toUTF8(char[] source, int offset, int length, byte[] out)Encode characters from a char[] source, starting at offset for length chars. It is the responsibility of the caller to make sure that the destination array is large enough.
-
UTF16toUTF8
public static int UTF16toUTF8(CharSequence s, int offset, int length, byte[] out)
Encode characters from this String, starting at offset for length characters. It is the responsibility of the caller to make sure that the destination array is large enough.
-
UTF16toUTF8
public static int UTF16toUTF8(CharSequence s, int offset, int length, byte[] out, int outOffset)
Encode characters from this String, starting at offset for length characters. Output to the destination array will begin atoutOffset. It is the responsibility of the caller to make sure that the destination array is large enough.note this method returns the final output offset (outOffset + number of bytes written)
-
calcUTF16toUTF8Length
public static int calcUTF16toUTF8Length(CharSequence s, int offset, int len)
Calculates the number of UTF8 bytes necessary to write a UTF16 string.- Returns:
- the number of bytes written
-
validUTF16String
public static boolean validUTF16String(CharSequence s)
-
validUTF16String
public static boolean validUTF16String(char[] s, int size)
-
codePointCount
public static int codePointCount(BytesRef utf8)
Returns the number of code points in this UTF8 sequence.This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).
- Throws:
IllegalArgumentException- If invalid codepoint header byte occurs or the content is prematurely truncated.
-
UTF8toUTF32
public static int UTF8toUTF32(BytesRef utf8, int[] ints)
This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.- Throws:
IllegalArgumentException- If invalid codepoint header byte occurs or the content is prematurely truncated.
-
codePointAt
public static UnicodeUtil.UTF8CodePoint codePointAt(byte[] utf8, int pos, UnicodeUtil.UTF8CodePoint reuse)
Computes the codepoint and codepoint length (in bytes) of the specifiedoffsetin the providedutf8byte array, assuming UTF8 encoding. As with other related methods in this class, this assumes valid UTF8 input and does not perform full UTF8 validation. Passing invalid UTF8 or a position that is not a valid header byte position may result in undefined behavior. This makes no attempt to synchronize or validate.
-
newString
public static String newString(int[] codePoints, int offset, int count)
Cover JDK 1.5 API. Create a String from an array of codePoints.- Parameters:
codePoints- The code arrayoffset- The start of the text in the code point arraycount- The number of code points- Returns:
- a String representing the code points between offset and count
- Throws:
IllegalArgumentException- If an invalid code point is encounteredIndexOutOfBoundsException- If the offset or count are out of bounds.
-
UTF8toUTF16
public static int UTF8toUTF16(byte[] utf8, int offset, int length, char[] out)Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of the caller to make sure that the destination array is large enough.NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.
-
maxUTF8Length
public static int maxUTF8Length(int utf16Length)
Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)
-
UTF8toUTF16
public static int UTF8toUTF16(BytesRef bytesRef, char[] chars)
Utility method forUTF8toUTF16(byte[], int, int, char[])- See Also:
UTF8toUTF16(byte[], int, int, char[])
-
-