Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

b0 = (n >> 12) + 0xe0;
b1 = ((n >> 6) & 63) + 0x80;
b2 = (n & 63) + 0x80;


Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb


14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

b0 = (n&15) + 224;
b1 = ((n>>6)&63) + 128;
b2 = (n>>12) + 128;


Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb


11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Anonymous 2014-08-29 19:13

>>14
8086 is commonly used for this job

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List