Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

RLE Blitting

Name: Anonymous 2019-08-17 15:20

Ok. Transitioning to RLE blitting haven't improved the performance that much - just 20% speedup, but code complexity greatly increased. One thing I noticed while measuring performance (for both rle and non-rle code) was that at times my static code completed two times faster, which is impossible because test used all static data (a sprite which it blitted a million of times in a loop), with only variables being branch prediction and I have 0% CPU load, and it doesn't make any syscall inside measured code. What does that even mean? Branch misprediction does affect performance, but not two times in the long run, because it would quickly retrain the cache on thousandth iteration.

Broken scheduling or OSX intentionally slowing down the code? Or maybe the Intel CPU itself does that? My MacBook is relatively old, so if it has any time bomb, it would be activated by now. Or maybe that is the infamous Meltdown fix slowing down my code two times? How does one disable the Meltdown patch? For Linux there is https://make-linux-fast-again.com/, but what about OSX? I don't care about security - it is overrated.

Name: Anonymous 2019-08-17 15:24

Name: Anonymous 2019-08-17 15:43

Name: Anonymous 2019-08-17 16:13

Also, for blitter, GCC compiled code is 30% faster than the Clang one. If that is not planned obsolescence, then what is planned obsolescence? Because compilers produce slower code with years passing.

Name: Anonymous 2019-08-17 16:49

In GCC __builtin_expect makes code faster, while in Clang it makes code slower. As if it intentionally uses that info to incur misprediction.

Name: Anonymous 2019-08-17 20:46

>>1
Are you using SIMD intrinsics?

Name: Anonymous 2019-08-17 21:20

>>6
Nope. It is 100% pure C integer code, doing RLE pixels skip, or just copying pixel.

regarding _builtin_expect, nemequ at stackoverflow explained that is what one should expect from it, because it actually does some unexpected black magic
https://stackoverflow.com/questions/57538301/clang-mishandles-builtin-expect?noredirect=1#comment101543316_57538301

https://pastebin.com/S8Y8tqZy
`__builtin_expect` can alter lots of different optimizations with different trade-offs…

Your assumption about compiler writers optimizing for their own architecture is probably invalid. You can control exactly which architecture the code is tuned for (see the `-mtune` option). There may still be a bit of bias in instruction selection, but for the most part the instructions are chosen automatically.

It also doesn't help that until recently (GCC 9, IIRC) there was no set probability for what `__builtin_expect` meant. Sometimes you would see a slowdown if it failed more than around 1% of the time, other times it's more like 10%. GCC recently added a `__builtin_expect_with_probability` and defined the probability for `__builtin_expect` to be 90%, I'd suggest taking a look at using that. Unfortunately clang hasn't (yet?) picked it up, but in the meantime you can use a macro like [`HEDLEY_PREDICT`](https://nemequ.github.io/hedley/api-reference.html#HEDLEY_PREDICT), which has a few possible definitions depending on the availability of `__builtin_expect_with_probability` and `__builtin_expect`:

```c
# define HEDLEY_PREDICT(expr, value, probability) __builtin_expect_with_probability(expr, value, probability)
# define HEDLEY_PREDICT(expr, expected, probability) \
(((probability) >= 0.9) ? __builtin_expect(!!(expr), (expected)) : (((void) (expected)), !!(expr)))
# define HEDLEY_PREDICT(expr, expected, probability) (((void) (expected)), !!(expr))
```

Name: Anonymous 2019-08-17 21:48

Clang taught me one thing: const int var = 123; is slower than just int var = 123;, which is slower than #define VAR 123. Why? I've no frigging idea, but clock() time.h says the code using #define is the fastest. And I always believed that nonsense that const is just decorative. Yet when it comes to compiler optimizations, it const actually influences size of the optimization tree compiler searches, while define is just a preprocessor.

Name: Anonymous 2019-08-18 0:23

>>7
stackoverflow is delete

Name: Anonymous 2019-08-18 5:02

>>8
define is a literal, its just inserted as is into code.
const int reserves a separate read-only heap memory space, which is slower when there is very few variables - and unless compiler knows its not needed in other units its CSE fails to eliminate the load/store.
int var is a normal variable loaded on the stack, usually optimized out.
mystery solved.

Name: Anonymous 2019-08-18 7:26

>>10
references to const int var = 123; should be just replaced by its value. In Clang's case it failed to eliminate const vars and branches on them even inside a single large routine. Although I've to note that failed had like a thousand of routines, so Clang probably had limited the analysis depth to avoid compilation hanging altogether. It is easy to analyze program with just a few vars, but when there are thousand vars, it is near impossible. IIRC, source code analysis cost grows as 2^N * V, where N is the number of branches involved, and V is the number of variables.

So yeah my advice: don't overwhelm compiler, always split your code into small chunks, and use #ifdef instead of if (const var) or if (liter) branches. There is no and will never be a "sufficiently smart compiler". Edited on 18/08/2019 07:28.

Name: Anonymous 2019-08-18 9:17

Ok. With all the fiddling I got 40 FPS boost - that is 130 FPS, from the previous 90 FPS. Basically it involved moving stuff randomly and checking the result speed. Edited on 18/08/2019 09:18.

Name: Anonymous 2019-08-18 10:22

references to const int var = 123; should be just replaced by its value
That is called constant propagation, its not automatic, it requires for the compiler to know the variable isn't accessed from outside the unit and isn't aliased by some other variables(which can swap/switch values, e.g. dynamic array pointers):
Basically its reserving a pointer to memory, unless it proves its completely unnecessary.

Name: Anonymous 2019-08-18 11:55

>>13
it requires for the compiler to know the variable isn't accessed from outside the unit

C/C++ is inherently flawed because all variables are exported by default, instead of being local.

Name: Anonymous 2019-08-18 13:03

Another strange thing... in some cases
x = b;
if (Condition) x = a;

is faster than
if (Condition) x = a;
else x = b;


Guess compiler manages to insert that x = b; in some branch delay slot, so it becomes free. It is funny how such little unimportant detail could mean a difference between 15 and 30 FPS in a game.

Name: Anonymous 2019-08-18 13:17

>>15
What about x=(Condition)?a:b;

Name: Anonymous 2019-08-18 13:43

>>16
Depends on the surrounding code, where it gets inlined. Benchmark it.

Name: Anonymous 2019-08-18 14:21

Another finding switch(index) can be slower than table[index](state_pointer)

Probably because compiler has smaller search space inside a localized function body, than inside a single large function with a huge switch.

Name: Anonymous 2019-08-18 17:03

>haven't improved the performance that much - just 20% speedup
that sounds like a pretty good improvement to me?

Name: Anonymous 2019-08-18 17:52

>>19
that sounds like a pretty good improvement to me?
I don't know.

Name: Anonymous 2019-08-18 19:26

>>19
Good improvements are about orders of magnitude. 20% is not noticeable by the user.

Name: Anonymous 2019-08-18 19:41

>>21
20% means I can spend them on additional effects, like the use of HDR light system. Or the game will run on smarthpones, instead of being PC-only.

Name: Anonymous 2019-08-18 20:10

GCC vs Clang appears to be highly political subject:
https://news.ycombinator.com/item?id=17617043
https://lwn.net/Articles/582697/

no wonder people get angry when I complain that Clang produces slow code.

Name: Anonymous 2019-08-18 22:36

>>23
people want to prepare themselves for the inevitable: gcc going the way of Emacs (extinct) and clang becoming the default. if they convince themselves that clang is better now, they don't have to deal with switching in the future.

Name: Anonymous 2019-08-18 23:06

>>24
Clang = SJWs
GCC = alt-rights
RMS = Hitler/Stalin


prove me wrong.

Name: Anonymous 2019-08-19 21:40

>>9
please help

Name: Anonymous 2019-08-20 19:43

Changing pixel size from uint32_t to uint64_t slows code down 1.65 times. That is the cost of 64-bit data vs 32-bit one. I.e. instead of 30 FPS in your game you will get 18. Given that it still makes sense using 32-bit or even 16-bit data where possible.

Name: Anonymous 2019-08-20 19:46

>>27
Although x86 has pipelining and this prefetch instruction, which could preload data, if computation takes longer than memory load. So yeah, large data size favors more complex computations.

Name: Anonymous 2019-08-21 21:40

>>27
You do not need 64 bits per pixel. You don't need 32 bits per pixel either. 21 bits per pixel (so 3 pixels/64 bits - 7 bits per color) should be enough, but it will ruin any chances for SIMD.

Name: Anonymous 2019-08-21 21:41

Funny thing is that the whole of unicode can fit in 21 bits. Maybe modern processors should add SIMD instructions specifically for 21 bits.

Name: Anonymous 2019-08-22 3:49

>>15
x = b
if (cond) x = a

compiles to
mov x b
bnc cond out
mov x a
out:


while
if (cond) x = a
else x = b

compiles to
mov x b
bnc cond else
mov x b
jmp out
else:
mov x a
out:

Which is two jumps. Jumps are expensive. More expensive than movs at least.

(assuming that things like arm thumb are not used)

Name: Anonymous 2019-08-22 3:55

>>16
MIPS and risc-v do not have instructions with 4 registers. In general they missed a lot of opportunities, such as 21-bit loads (6-bit opcode, 5-bit register), which would let you set a 64-bit register in 3 instructions.

Name: Anonymous 2019-08-22 7:06

blit my dubs

Name: Anonymous 2019-08-22 10:35

>>29
The problem is that you do need, if you want HDR light.

Anyway, if you need just gamma correct output, then I found that it is cheaper to use 32-bit RGB and do gamma packing and unpacking on fly, instead of 64-bit unpacked RGB. Same with HDR: it is cheaper to repack RGB into smaller gamma, than to use plain 64-bit per pixel format, unless you use SIMD, in which case plain half float format will be a bit faster. And I hoped that pipelining would solve the memory bandwidth problem. I've also introduced my own 32-bit HDR format, called LUV, but the catch is: it needs additional 8-bit for alpha channel, because L is 16-bit, and U,V are 8-bit. Making L 8 bit will greatly reduce quality. Although one could steal 2 bit from alpha channel, and 2 bits from UV to have 12-bit L, which would actually make it a passable alternative to RGB.

TLDR: either use GPU or limit your graphics to 32-bit per pixel.

Name: Anonymous 2019-08-22 14:20

The difference between no -march flag and the -march=native to Clang is the increase of FPS from 60 to 65.

Name: Anonymous 2019-08-23 2:04

>>30
Unicode is meant to be ever expanding as languages evolve. "21 bits" won't be enough this next century.
It's also a shit standard that refuses properly documented languages even as colloquial as they maybe.
All it needs a bitstream op for clean 8bits. I don't comprehend why ANSI picked 7bits for their charset.
>>35
Citations?

Name: Anonymous 2019-08-23 9:34

>>36
Citations?
Personal experience.

Name: Anonymous 2019-08-23 10:16

>>37
so your're are citing your're are anus

Name: Anonymous 2019-08-24 1:51

Name: Anonymous 2019-08-28 7:16

>>38
fuck off back to academia, retard

Name: Anonymous 2019-08-28 7:28

>>40
hax my anus

Name: Anonymous 2019-08-28 13:08

fuck my anus

Name: Anonymous 2019-08-29 20:58

>>42
You wish

Name: Anonymous 2019-09-01 2:35

Pork my pignus

Name: Anonymous 2019-09-02 13:23

>>44
That's all, Folks!

Name: Anonymous 2019-09-02 21:07

I was looking for some efficient alternative to mutexes, which don't put the thread to sleep, but found none, and generally people recommend some nonsense, like refactoring and decoupling the code (wasting a lot of time for nothing), instead of inserting the multithreading as a cheap speedup into the existing codebase. When I suggested the superior alternative to just do while(!signaled); I immediately got downvoted:
https://stackoverflow.com/questions/6460542/performance-of-pthread-mutex-lock-unlock/57749968#57749968

Typically every good answer gets heavily downvoted for offering some simple solution, which is not a "good practice". I.e. you propose using key-value database as an alternative to SQL, but instead of good argumentation against kv dbs, you will hear some autistic screeching about how SQL was the product of much experience, several PhD papers and therefore everyone should follow SQL teaching like it is some holy Quran. Well, you know what? Haskell also grown out of experience of how bad are side effects, and then helped with making PhDs, but you wont be using Haskell in any non-toy project.

Generally there are two kinds of programmers:
1. Programmers who write actual code, which solves the problem.
2. The retards who, instead of code, write unit tests with getters/setters whole time, because it is the "good practice" recommended by some iconic bible by some deranged lunatic like Bjarne Straustrup. These same programming Nazis will scold you for doing "#define PI" instead of "const double pi", using indentation style they dislike (I find it useful putting `;` and `,` before the statements) or not prefixing member variable names with "m_".

Ideally there should be some IQ test, so all such autistic retards could be identified and sent to a country designed for people with special needs. Like they have villages for blind people. Edited on 02/09/2019 21:30.

Name: Anonymous 2019-09-02 23:04

>>46
6/10; nice to read. Highlights are the edited line at the bottom and criticizing getters and setters which is indeed correct and accessors/mutators are just retarded or meant for stupid languages with not enough features to provide for that if ever necessary. But the ; and , before statements is just too obvious.

Name: Anonymous 2019-09-02 23:17

>Ideally there should be some IQ test, so all such autistic retards could be identified and sent to a country designed for people with special needs. Like they have villages for blind people.
Is this your goal for Russia once your plans to destroy it succeed?

Name: Anonymous 2019-09-03 13:25

Sending all special needs people to special needs cities would make life simpler though.

Name: Anonymous 2019-09-04 7:01

>>46
busy waiting is slow as fuck compared to waiting on a mutex though. your're are hogging a core with constant checks of the volatile variable, negating most of the things you gain by spawning a thread. an alternative that would be faster and avoid most of the mutex-related overhead (but would be fairly risky when it comes to race conditions/deadlocks/livelocks) would be a busy wait coupled with a usleep().

Name: Anonymous 2019-09-04 7:54

It's safer to just do something like:

while (owned):
usleep(100)
owned = 1
...
owned = 0


There. No PhDs or anything.

Name: Anonymous 2019-09-04 7:55

>>50
busy waiting is slow as fuck
Unless threads are tightly coupled and each core is always used to 100%. I.e. if one core writes like 100 samples into a buffer, and another core immediately applies some effect onto them. Using mutex would add a huge slowdown, so it would make no sense to use two cores.

Just make sure you have enough free cores for that, possibly nicing your process at -10. That is enormously bad practice, but it works.

Name: Anonymous 2019-09-04 7:57

>>51
In my case you need nanosleep, not usleep.

Name: Anonymous 2019-09-04 8:01

>>53
I.e. I have a 1000000 of little sprites, like pixel sized particles: one thread asks to draw a sprite, another immediately starts drawing, so any amount of sleep would kill the performance, because one sprite could be drawn in a microsecond. The "good practice" would be refactoring the code by increasing a queue, but refactoring is that autistic practices, which doesn't add any new feature, beside making code more bloated and complex.

Name: Anonymous 2019-09-04 9:10

>>52
Just make sure you have enough free cores for that
download moar coars

Name: Anonymous 2019-09-04 9:11

nanosleep doesn't work as intended unless the kernel is real time. The coarse-grain multithreading is dealing with random delays of several milliseconds at least.

Name: Anonymous 2019-09-04 9:16

Just run graphics as single separate thread.(this approach also works with audio, input,etc for seamless low-latency feedback). If your game need multi-threaded graphics its overengineered bs that should be replaced with a game engine.

Name: Anonymous 2019-09-04 9:22

>>57
his game is a turn-based strategy game with sprite-based graphics, it can work without noticeable hiccups on a single thread. he's just bikeshedding performance to avoid doing any real work, and to have something to complain about

Name: Anonymous 2019-09-04 10:13

>>58
I'm using the game as an opportunity to learn various threading nuances. And the more FPS I've, the more effects I can do, without using OpenGL. Ideally I want to have software trilinear sampling to scale the view, but that is a lot of work.

Name: Anonymous 2019-09-04 10:18

>>57
Decoupled thread still needs to communicate with the part giving draw orders. Now there is a choice between refactoring it to establish a large queue of draw requests, and using normal mutexes for communication, or use existing codebase and just do the while(locked); busy loop. Obviously Smartphone users won't thank me for doing that and discharging their battery, but who cares about users today?

Name: Anonymous 2019-09-04 10:21

>>56
Well. One could run for 1000000 times, and if there is still no request and start sleeping for increasingly larger periods. But that is overengineering for a small indie game, which doesn't have to be nice to the rest of software running inside OS.

Don't change these.
Name: Email:
Entire Thread Thread List