/prog/ - Why browsers are bloated [Part 2]

Name: Anonymous 2016-04-23 22:49

Cudder is all talk and no action!

Name: Cudder !cXCudderUE 2017-01-16 12:30

I've discovered that all of the HTML5 character entity references (&xxxx;) are at least as long as the codepoints they represent in UTF-8, meaning that I can do the conversion in-place or on an allocation the exact same size as the original string...

...except for these two bastards which are one more byte longer:

&nGt; U+226B U+20D2 ; E2 89 AB E2 83 92
&nLt; U+226A U+20D2 ; E2 89 AA E2 83 92

Two out of 2000+! I'm not going to support those, because it's absolutely idiotic to have to redo and complexify the whole buffer allocation logic just to handle these 0.1% of them which is probably very rarely going to occur in real pages anyway.

What's even more idiotic, is that if someone paid some attention to implementation and made them just a single byte longer e.g. &nGGt; or &nLLt; , they would've fit in perfectly with the rest of them. But it might be too much to expect of the W3C.

"Plan Ahead"

Name: Anonymous 2017-01-16 16:36

>>137
Wtf man. Of course not.
http://www.w3schools.com/html/html_entities.asp
It's just that ≪ and ≫ won't be supported, all other 2000(+) entities will be.

HIBT? Can't you google simple shit like this yourself?

Name: Cudder !cXCudderUE 2017-02-02 11:39

Upon closer examination, there's even more retardedness in the entities list --- some are defined more than once! What sort of fucked-up design-by-committee lead to this idiocy?

We start with some "only slightly retarded" duplication...

ast;    U+0002A 
midast;	U+0002A

lbrack; U+0005B
lsqb;   U+0005B

...move onto WTF-inducing "you're an idiot if you think this is even the slightest bit useful"...

lowbar;   U+0005F
UnderBar; U+0005F

grave;            U+00060
DiacriticalGrave; U+00060

nbsp;             U+000A0
NonBreakingSpace; U+000A0

...and finish with "ARE YOU FUCKING INSANE!?!?"

die;       U+000A8
Dot;       U+000A8
DoubleDot; U+000A8
uml;       U+000A8

ap;          U+02248
approx;      U+02248
asymp;       U+02248
thickapprox; U+02248
thkap;       U+02248
TildeTilde;  U+02248

Bonus level:

NegativeMediumSpace;   U+0200B
NegativeThickSpace;    U+0200B
NegativeThinSpace;     U+0200B
NegativeVeryThinSpace; U+0200B
ZeroWidthSpace;        U+0200B

Completely different names, yet the exact same codepoint. :quintuple-facepalm:

See it yourself at https://www.w3.org/TR/html5/syntax.html (scroll to bottom, extract table, sort by codepoint.)

Now I know why there are 2K+ entities. Around half of them are duplicates with an extra ';' at the end (easily handled by the parsing code, but the brainless turds that wrote the spec did not even think...), the other 1/4 are useless duplicates, and what's left is possibly, maybe sometimes, actually useful. But supposedly to be "HTML5 compliant" you would need to parse them all, regardless of whether anyone will actually use them except in demo pages and the like (probably not). Fuck that bullshit.

"Why browsers are bloated".

Name: Anonymous 2017-02-02 11:49

>>159
Horrible!
I hope in the end you report all this bullshit so they update their spec.

Name: Anonymous 2017-02-02 14:54

>>161
So you think it's an intelligent design to have duplications like those?
Get real.

Name: Anonymous 2017-02-03 14:05

>>162
The problem is an identity problem, it is not bloat. The committee had overlooked this specific part of the standard and it is good to point it out to them so that they can get to assigning unique identities (codepoints) for the characters. It would only be insane if they respond by saying this is "not bug, it's a feature".

Name: Anonymous 2017-02-03 14:49

I looked at HTML 5.1 spec.
Not fixed there either.
https://www.w3.org/TR/html51/syntax.html

This is where you go to file an issue.
https://github.com/w3c/html/issues

Name: Anonymous 2017-02-03 14:55

I looked at the WHATWG spec.
Not fixed there either.
https://html.spec.whatwg.org/multipage/syntax.html

This is where you go to file an issue.
https://github.com/whatwg/html/issues

Name: Cudder !cXCudderUE 2017-02-05 2:29

>>174
No, they need to remove them completely because they're essentially useless. The original purpose of named entities, besides escaping (e.g. >), was so you could use Unicode/ISO10646 characters in a file with non-Unicode encoding. With the proliferation and recommendation of UTF-8, the need for that has decreased significantly.

>>175-177
I'm not going to get into the whole W3C vs WHATWG debate, but if anyone wants to tell them about this bloat, they should tell both of them.

But the fact that absolutely no one on either committee discovered or pointed it out is an epic fail. These are people whose main job is to read and discuss the spec, and yet apparently none of them saw it or decided to say something...

Here's a cleaned up entity list, less than 1.5K entities from the original 2K+. In other words, 25% of the spec was bloatshit:

http://pastebin.com/2gth3uPv

Why browsers are bloated [Part 2]

1 Name: Anonymous 2016-04-23 22:49

135 Name: Cudder !cXCudderUE 2017-01-16 12:30

138 Name: Anonymous 2017-01-16 16:36

159 Name: Cudder !cXCudderUE 2017-02-02 11:39

160 Name: Anonymous 2017-02-02 11:49

162 Name: Anonymous 2017-02-02 14:54

174 Name: Anonymous 2017-02-03 14:05

175 Name: Anonymous 2017-02-03 14:49

177 Name: Anonymous 2017-02-03 14:55

179 Name: Cudder !cXCudderUE 2017-02-05 2:29