/prog/ - UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

 b0 = (n >> 12) + 0xe0;
 b1 = ((n >> 6) & 63) + 0x80;
 b2 = (n & 63) + 0x80;

Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb

14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

 b0 = (n&15) + 224;
 b1 = ((n>>6)&63) + 128;
 b2 = (n>>12) + 128;

Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb

11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Anonymous 2017-01-10 17:21

Little endian is a kike right-to-left scam.

More significant (higher value) bits are on the left, because the goyim who designed bit shifting wrote in big endian.

Cudder doesn't even understand that even on his kiketel chips, you ``shift left'' to make bits more significant (increase their value). Left and right are goy, even though his scheme is kike.

Shift to finish:
00000000 00000000 aaaabbbb bbcccccc

Rotate to finish:
00000000 00000000 ccccccbb bbbbaaaa

The c, b, and a are backwards in his version, but forwards in our version, proof that it's a kike scam. The alphabet goes ABC, not CBA.

In decoding, he is using the upper part of the register as a temporary, disguised by partial register writes to the lower byte.

He also wrote his bytes in big endian to disguise the abomination of little endian. If he wasn't scamming about little-endian, it would look like this:

Rotate to finish:
bbbbaaaa ccccccbb 00000000 00000000

That is how they would be stored in memory on a little endian CPU.

UTF-8LE - a more efficient, saner version of UTF-8

1 Name: Cudder !MhMRSATORI 2014-08-29 12:03

49 Name: Anonymous 2017-01-10 17:21

Name: Cudder !MhMRSATORI 2014-08-29 12:03

Name: Anonymous 2017-01-10 17:21