Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

b0 = (n >> 12) + 0xe0;
b1 = ((n >> 6) & 63) + 0x80;
b2 = (n & 63) + 0x80;


Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb


14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

b0 = (n&15) + 224;
b1 = ((n>>6)&63) + 128;
b2 = (n>>12) + 128;


Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb


11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Cudder !cXCudderUE 2017-01-11 11:27

>>49
"left" and "right" are arbitrary physical directions, numbers are not. Roughly 90% of the "explanations" of endianness out there are misleading or incorrect. Any explanation that uses "left", "right", "first", "last", "up", "down", or any other physical direction fits into that category. Little endian bit ordering means bit address n has value 2n. Little endian byte ordering means byte address n has value 256n. Endianness is a property of address-to-value mappings, not physical directions. It doesn't matter one bloody bit (pun intended) what it looks like to you since that's just a physical manifestation of the data.

http://www.verilab.com/files/VL_Whitepaper_Endian_July2008.pdf

IHBT.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List