I've been tweaking my fibs implementation and it still seems slow in some cases.
How could I make it faster? Any ideas? Any obvious optimilizations I'm missing here?
fibs: cmp rdi, 1 ja .l1 mov rax, rdi ret .l1: dec rdi push rdi call fibs pop rdi push rax dec rdi call fibs pop rcx add rax, rcx ret
Name:
Anonymous2015-12-18 18:42
THE RACE REALISM BEHIND C SUPREMACY
Nice topic, OP. Your example is too slow to run any meaningful benchmark, sorry, but let's see about some other entries why it's tricky to write asm these days:
// semantically close to the above static inline long fib_slow(long n) { long a,b,t; a = 0; b = 1; do { t = a+b; a = b; b = t; } while (--n > 0); return a; }
// optimized, select left or right branch static inline long fib_fast(long n) { long a,b,t; a = 0; b = 1; while ((n -= 2) >= 0) { t = a+b; a = b; b = t; t = a+b; a = b; b = t; } return (n&1)?b:a; }
int main() { int i; long r = 0; for (i = 0; i < 100000000; i++) #ifdef ASM r += fib_cudder(i & 47); #elif defined(CSLOW) r += fib_slow(i & 47); #elif defined(CFAST) r += fib_fast(i & 47); #endif return r; }
Now let's benchmark!
$ gcc -DASM -DCUDDER fib.c; time ./a.out ^C real 14m55.977s user 14m55.940s sys 0m0.000s
Sadly the original cudder's submission gets stuck in a loop for n=0. Meaning the throughput of his original implementation is about O(2^59) per one fib on average and it most likely gives incorrect result for n=0.
Ok, let's try it with slightly modified version, where n=0 is checked for.
$ gcc -DASM -O6 fib.c; time ./a.out
real 0m4.713s user 0m4.690s sys 0m0.010s
Not too shabby, but still should be about magnitude faster for iterative fib.
Now try literal transcription of cudder's iterative version into C: $ gcc -DCSLOW -O6 fib.c; time ./a.out
real 0m1.195s user 0m1.190s sys 0m0.000s
Oh, it's 4x faster? But what about magnitude using a textbook iterative fib?
$ gcc -DCFAST -O6 fib.c; time ./a.out
real 0m0.746s user 0m0.740s sys 0m0.000s
6.5x, close enough. I'm sure it could be 8x or slightly more in properly written asm (read compiler generated, and manually tidied up). Especially if you manage more pipelining than current 3 adds.
But why is C almost a magnitude faster than naive 386-era hand written assembly? Because humans suck at tracking data dependencies, the internal SSA CPU jits your x86 opcodes into and avoiding pipeline stalls. Compiler can do much, much better job. The amount of instructions you use is very poor metric for speed of code, yet this cargo cult from 386 era prevails :/
Note that you don't want to actually compute fibs if you need em in the real world. There are roughly nbits*1.5 numbers in the sequence for nbit n, so you can just do lookups.