Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Can you optimize my fibs?

Name: Anonymous 2015-12-16 17:07

Hi /frog/,

I've been tweaking my fibs implementation and it still seems slow in some cases.

How could I make it faster? Any ideas? Any obvious optimilizations I'm missing here?

fibs:
cmp rdi, 1
ja .l1
mov rax, rdi
ret
.l1:
dec rdi
push rdi
call fibs
pop rdi
push rax
dec rdi
call fibs
pop rcx
add rax, rcx
ret

Name: Anonymous 2015-12-18 18:42

THE RACE REALISM BEHIND C SUPREMACY

Nice topic, OP. Your example is too slow to run any meaningful benchmark, sorry, but let's see about some other entries why it's tricky to write asm these days:

static inline long fib_cudder(long n)
{
long ret;
__asm__ (
#ifndef CUDDER
"jrcxz finito;"
#endif
"xor %%eax, %%eax;"
"inc %%eax;"
"cqo;"
"fibloop:;"
"xchg %%eax, %%edx;"
"add %%edx, %%eax;"
"loop fibloop;"
"finito:;"
: "=d" (ret) : "c" (n) : "rax", "cc"
);
return ret;
}

// semantically close to the above
static inline long fib_slow(long n)
{
long a,b,t;
a = 0;
b = 1;
do {
t = a+b;
a = b;
b = t;
} while (--n > 0);
return a;
}

// optimized, select left or right branch
static inline long fib_fast(long n)
{
long a,b,t;
a = 0;
b = 1;
while ((n -= 2) >= 0) {
t = a+b;
a = b;
b = t;
t = a+b;
a = b;
b = t;
}
return (n&1)?b:a;
}

int main()
{
int i;
long r = 0;
for (i = 0; i < 100000000; i++)
#ifdef ASM
r += fib_cudder(i & 47);
#elif defined(CSLOW)
r += fib_slow(i & 47);
#elif defined(CFAST)
r += fib_fast(i & 47);
#endif
return r;
}


Now let's benchmark!

$ gcc -DASM -DCUDDER fib.c; time ./a.out
^C
real 14m55.977s
user 14m55.940s
sys 0m0.000s


Sadly the original cudder's submission gets stuck in a loop for n=0. Meaning the throughput of his original implementation is about O(2^59) per one fib on average and it most likely gives incorrect result for n=0.

Ok, let's try it with slightly modified version, where n=0 is checked for.

$ gcc -DASM -O6 fib.c; time ./a.out

real 0m4.713s
user 0m4.690s
sys 0m0.010s


Not too shabby, but still should be about magnitude faster for iterative fib.

Now try literal transcription of cudder's iterative version into C:
$ gcc -DCSLOW -O6 fib.c; time ./a.out

real 0m1.195s
user 0m1.190s
sys 0m0.000s


Oh, it's 4x faster? But what about magnitude using a textbook iterative fib?

$ gcc -DCFAST -O6 fib.c; time ./a.out

real 0m0.746s
user 0m0.740s
sys 0m0.000s


6.5x, close enough. I'm sure it could be 8x or slightly more in properly written asm (read compiler generated, and manually tidied up). Especially if you manage more pipelining than current 3 adds.

But why is C almost a magnitude faster than naive 386-era hand written assembly? Because humans suck at tracking data dependencies, the internal SSA CPU jits your x86 opcodes into and avoiding pipeline stalls. Compiler can do much, much better job. The amount of instructions you use is very poor metric for speed of code, yet this cargo cult from 386 era prevails :/

Note that you don't want to actually compute fibs if you need em in the real world. There are roughly nbits*1.5 numbers in the sequence for nbit n, so you can just do lookups.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List