/prog/ - UTF-8LE - a more efficient, saner version of UTF-8

Name: Cudder !MhMRSATORI 2014-08-29 12:03

I just realised that UTF-8 is stupidly defined as big-endian!

U+20AC = 0010 000010 101100
In UTF-8 -> 11100010 10000010 10101100

...Meaning that to convert a codepoint into a series of bytes you have to shift the value before anding/adding the offset, viz.

 b0 = (n >> 12) + 0xe0;
 b1 = ((n >> 6) & 63) + 0x80;
 b2 = (n & 63) + 0x80;

Just looking at the expression it doesn't seem so bad, but shifting right before means throwing away perfectly good bits in a register! The worst thing is, the bits thrown away are exactly the ones needed in the upcoming computations, so you have to needlessly waste storage to preserve the entire value of the codepoint throughout the computation. Observe:

push eax
shr eax, 12
add al, 224
stosb
pop eax
push eax
shr eax, 6
and al, 63
add al, 128
stosb
pop eax
and al, 63
add al, 128
stosb

14 instructions, 23 bytes. Not so bad, but what if we stored the pieces the other way around, i.e. "UTF-8LE"?

U+20AC = 001000 001010 1100
In UTF-8LE -> 11101100 10001010 10001000

 b0 = (n&15) + 224;
 b1 = ((n>>6)&63) + 128;
 b2 = (n>>12) + 128;

Observe that each time bits are picked off n, the next step's shift removes them, so there is no need to keep around another copy of n (including bits that wouldn't be used anymore).

shl eax, 4
shr al, 4
add al, 224
stosb
mov al, ah
and al, 63
add al, 128
stosb
shr eax, 14
add al, 128
stosb

11 instructions, 22 bytes. Clearly superior.

The UTF-8 BOM is EF BB BF; the UTF-8LE BOM similarly will be EF AF BF.

Name: Anonymous 2017-01-09 9:30

1. Lexicographic sorting
2. memcmp()

Name: Cudder !cXCudderUE 2017-01-10 11:15

>>41
1. Sorting by Unicodepoint really makes sense just for the ASCII subset, and only with a binary collation. For everything else there's http://www.unicode.org/reports/tr10/

2. UTF-8LE has all the restrictions of regular UTF-8 like no overlong forms etc. and can be treated like UTF-8 strings for prposes of hash tables and other data structures.

Name: Anonymous 2017-01-10 11:28

stop being a namefag

Name: Anonymous 2017-01-10 12:29

Check dubs

Name: Anonymous 2017-01-10 13:06

>>43
stop being an imageredditor

Name: Anonymous 2017-01-10 14:01

>>45

defending namefags

fuck off

Name: Anonymous 2017-01-10 14:30

>>46

defending imageredditors

Name: Anonymous 2017-01-10 15:42

>>46
ebin

le implyin/g/

meme /g/ro!!!!!

Name: Anonymous 2017-01-10 17:21

Little endian is a kike right-to-left scam.

More significant (higher value) bits are on the left, because the goyim who designed bit shifting wrote in big endian.

Cudder doesn't even understand that even on his kiketel chips, you ``shift left'' to make bits more significant (increase their value). Left and right are goy, even though his scheme is kike.

Shift to finish:
00000000 00000000 aaaabbbb bbcccccc

Rotate to finish:
00000000 00000000 ccccccbb bbbbaaaa

The c, b, and a are backwards in his version, but forwards in our version, proof that it's a kike scam. The alphabet goes ABC, not CBA.

In decoding, he is using the upper part of the register as a temporary, disguised by partial register writes to the lower byte.

He also wrote his bytes in big endian to disguise the abomination of little endian. If he wasn't scamming about little-endian, it would look like this:

Rotate to finish:
bbbbaaaa ccccccbb 00000000 00000000

That is how they would be stored in memory on a little endian CPU.

Name: Anonymous 2017-01-11 4:26

>>49
easy to see

'goy': first = god/nature = greater
'kike': last = man/science = greater

Name: Anonymous 2017-01-11 4:29

(always been a big endian man myself :)

Name: Anonymous 2017-01-11 5:17

>>49
Little-endian makes more sense, because if you have a value small enough to fit in an unsigned byte, the address of that variable is the same regardless of whether its storage is specified as an unsigned byte or an unsigned dword.

Name: Cudder !cXCudderUE 2017-01-11 11:27

>>49
"left" and "right" are arbitrary physical directions, numbers are not. Roughly 90% of the "explanations" of endianness out there are misleading or incorrect. Any explanation that uses "left", "right", "first", "last", "up", "down", or any other physical direction fits into that category. Little endian bit ordering means bit address n has value 2ⁿ. Little endian byte ordering means byte address n has value 256ⁿ. Endianness is a property of address-to-value mappings, not physical directions. It doesn't matter one bloody bit (pun intended) what it looks like to you since that's just a physical manifestation of the data.

http://www.verilab.com/files/VL_Whitepaper_Endian_July2008.pdf

IHBT.

Name: Anonymous 2017-01-11 20:01

big endian is pretty convenient because reading numbers in left-to-right languages doesn't confuse you from what the data layout actually is.

people always say oh well you can just have a disassembler (or whatever) show you the bytes in readable order, but that's just stupid and bound to cause confusion

Name: Anonymous 2017-01-11 20:37

>>54
That rarely comes up these days, because flat binary files are so rarely used, so trying to read a binary without a disassembler is just making a lot of extra work for yourself, because you have to make sense of things like distinguishing the file header from the actual instructions. It makes more sense to focus on what makes things easier when working with asm/C and upwards.

Name: Anonymous 2017-01-11 21:02

>>55

because flat binary files are so rarely used

Everyone uses XML and JSON in your world, right?

Name: Anonymous 2017-01-12 2:09

>>56
I'm talking about things like the .EXE format.

Name: Cudder !cXCudderUE 2017-01-12 12:14

>>54,55
LE doesn't confuse you either if you have a working brain. Little endian is the Logical choice. Big endian is Backwards.

Name: Anonymous 2020-08-05 11:35

>>1
This is only because x86 is shiiiiit

>>53
Big endian: there are 8000000000 people
little endian: there are 0000000008 people

Name: Anonymous 2020-08-05 12:44

>>58
You are a faggot, and a fucking retard.

Name: Astolfo 2020-08-06 14:18

I don't like that, too much space and not enough information.

Name: Anonymous 2020-08-07 8:27

1 	U+0000 	U+007F 	0xxx xxxx 	
2 	U+0080 	U+07FF 	110x xxxx 	10xx xxxx 	
3 	U+0800 	U+FFFF 	1110 xxxx 	10xx xxxx 	10xx xxxx 	
4 	U+10000        	1111 0xxx 	10xx xxxx 	10xx xxxx 	10xx xxxx

It does look a bit clunky

Name: Anonymous 2020-08-07 8:38

The only good thing about padding every byte like that is you can check if any byte is a first byte?

Name: Anonymous 2020-08-07 8:44

So the first nibble is mostly redundant if you are checking 10xx for the byte continue

eg 110x xxxx 	10xx xxxx   10xx xxxx 	10xx xxxx

Name: Anonymous 2020-08-07 9:12

[m]0xxx xxxx 10xx xxxx 10xx xxxx 10xx xxxx[m] is the largest

Name: Anonymous 2020-08-07 9:22

11zz could be a two bit space

00xx and 01xx can use 10xx extensions

Name: Anonymous 2020-08-07 9:36

1 	U+0000 	U+003F 	00zz xxxx
2 	U+0040 	U+007F 	01zz xxxx
3 	U+0080 	U+00BF 	10zz xxxx 	
4 	U+0080 	U+07FF 	00zz xxxx 	11xx xxxx
5 	U+0080 	U+07FF 	01zz xxxx 	11xx xxxx
6 	U+0080 	U+07FF 	10zz xxxx 	11xx xxxx

costs half a bit?

Name: Anonymous 2020-08-07 12:27

plus the two per byte of extention

1-1 bit / 0.5 bit
2-5 bit / 2.5
3-8 bit / 4.5
4-11 bit / 6.5

50% less control bits

Name: Anonymous 2020-08-23 20:56

(◞‸ლ)
You're still presuming people are still using x86 sweetie.
assemble in any other *ISC and your LE observation goes out the window.

when's your CISC?

Name: Anonymous 2020-08-24 0:30

You can have

0nnnnnnn -> ascii
0nnnnnnn 1nnnnnnn -> unicode between 16#80 to 16#407f
0nnnnnnn 1nnnnnnn 1nnnnnnn -> unicode between 16#4080 to 16#20407f

and the alternative is

0nnnnnnn -> ascii
10nnnnnn nnnnnnnn -> unicode between 16#80 to 16#407f
110nnnnn nnnnnnnn nnnnnnnn -> unicode between 16#4080 to 16#20407f

It is impressive how much they messed up something as simple as character encoding.
That way utf8 would need up to 3 bytes rather than 4.

>>69
RISC-V, ARM, etc are all LE.

Name: Anonymous 2020-08-24 2:18

I meant this for the first one:

0nnnnnnn -> ascii
1nnnnnnn 0nnnnnnn -> unicode between 16#80 to 16#407f
1nnnnnnn 1nnnnnnn 0nnnnnnn -> unicode between 16#4080 to 16#20407f

Name: Anonymous 2020-08-24 3:20

>>71
branch me seL4 with your changes and you got yourself a deal.

Name: Anonymous 2020-08-24 4:16

>>72
You do not need kernel support.

Name: Anonymous 2020-08-24 20:58

>>73
I need an entire distro's support.
If it can operate nicely on a daily driver I can give to my niece when she plays パネルでポン and I submit my taxes, then it's good.

Name: Cudder !cXCudderUE 2024-10-27 5:46

Another UTF-8 stupidity: overlong encodings. Instead of actually starting the 2-byte encodings at U+0080, they started it at U+0000 but just declared the first 128 codepoints encoded in 2 bytes to be invalid, introducing overlong forms and causing tons of extra edge-case complexity in every decoder.

Others have noticed this too:

https://stackoverflow.com/questions/69318921/why-does-the-utf-8-encoding-have-the-concept-of-overlongs

https://stackoverflow.com/questions/57431095/whats-the-rationale-for-utf-8-to-store-the-code-point-directly

Name: Anonymous 2024-11-05 22:53

>>6

Why is this level of optimization this so important for Unicode? It's no longer 1980, we can actually afford multi-Ghz CPU and many gigabytes of RAM.

>It's no longer 1980,
Soon it will be.
We're in WW3

Name: Anonymous 2024-11-25 9:29

ur a dumb cunt cudder

Name: Anonymous 2024-12-13 19:05

you shouldnt use that kind of language with a lady

Name: Anonymous 2025-01-25 9:36

This thing about unichode being an inefficient mess:
it doesn't matter if you have the hardware, it still
a completely disgusting design that requires tons of
edge case handling and clever workarounds.
The obsession with adding useless emojis and sponsored symbols, the whole idea of malleable, updatable translation
layer is something you'll think if the text was controlled
by a corporation and treated as product instead of
concrete standard. Normies will not understand until
its 90% emoji and trademark symbols, but by then it would be something cringe like PDF format(with typesetting incorporated as modifier symbols) that requires a specialized library to see.

Name: Anonymous 2025-01-28 23:13

💩

UTF-8LE - a more efficient, saner version of UTF-8

1 Name: Cudder !MhMRSATORI 2014-08-29 12:03

41 Name: Anonymous 2017-01-09 9:30

42 Name: Cudder !cXCudderUE 2017-01-10 11:15

43 Name: Anonymous 2017-01-10 11:28

44 Name: Anonymous 2017-01-10 12:29

45 Name: Anonymous 2017-01-10 13:06

46 Name: Anonymous 2017-01-10 14:01

47 Name: Anonymous 2017-01-10 14:30

48 Name: Anonymous 2017-01-10 15:42

49 Name: Anonymous 2017-01-10 17:21

50 Name: Anonymous 2017-01-11 4:26

51 Name: Anonymous 2017-01-11 4:29

52 Name: Anonymous 2017-01-11 5:17

53 Name: Cudder !cXCudderUE 2017-01-11 11:27

54 Name: Anonymous 2017-01-11 20:01

55 Name: Anonymous 2017-01-11 20:37

56 Name: Anonymous 2017-01-11 21:02

57 Name: Anonymous 2017-01-12 2:09

58 Name: Cudder !cXCudderUE 2017-01-12 12:14

59 Name: Anonymous 2020-08-05 11:35

60 Name: Anonymous 2020-08-05 12:44

61 Name: Astolfo 2020-08-06 14:18

62 Name: Anonymous 2020-08-07 8:27

63 Name: Anonymous 2020-08-07 8:38

64 Name: Anonymous 2020-08-07 8:44

65 Name: Anonymous 2020-08-07 9:12

66 Name: Anonymous 2020-08-07 9:22

67 Name: Anonymous 2020-08-07 9:36

68 Name: Anonymous 2020-08-07 12:27

69 Name: Anonymous 2020-08-23 20:56

70 Name: Anonymous 2020-08-24 0:30

71 Name: Anonymous 2020-08-24 2:18

72 Name: Anonymous 2020-08-24 3:20

73 Name: Anonymous 2020-08-24 4:16

74 Name: Anonymous 2020-08-24 20:58

75 Name: Cudder !cXCudderUE 2024-10-27 5:46

76 Name: Anonymous 2024-11-05 22:53

77 Name: Anonymous 2024-11-25 9:29

78 Name: Anonymous 2024-12-13 19:05

79 Name: Anonymous 2025-01-25 9:36

80 Name: Anonymous 2025-01-28 23:13