Tuesday, January 1, 2013

JNI and modified UTF8

In Modified UTF-8, the null character (U+0000) is encoded as 0xC0,0x80; this is not valid UTF-8 because it is not the shortest possible representation. Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000, which allows such strings (with a null byte appended) to be processed by traditional null-terminated string functions. (Wikipedia)
A lot of NDK-JNI samples use GetStringUTFLength to allocate memory and store new native string. Even this great book by Sylvain Ratabouil. And it works well with simple English words. I used it in context of small Estonian word list, and of course it failed on accented letters.
So, to make it clear once more, according to JNI functions reference, GetStringUTFLength returns length in bytes of modified UTF-8 representation of string. And GetStringLength returns count of Unicode characters in string. I.e. for string 'täht' GetStringUTFLength will return 5, and GetStringLength will return 4 (which is correct if we want to malloc some memory for native string).
 
const jsize unicode_length = (*pEnv)->GetStringLength(pEnv,lString);
const jsize utf8_length = (*pEnv)->GetStringUTFLength(pEnv,lString);


Happy New year, btw :)

No comments:

Post a Comment