/prog/ - Can you optimize my fibs?

Name: Anonymous 2015-12-18 18:42

THE RACE REALISM BEHIND C SUPREMACY

Nice topic, OP. Your example is too slow to run any meaningful benchmark, sorry, but let's see about some other entries why it's tricky to write asm these days:

static inline long fib_cudder(long n)
{
	long ret;
	__asm__ (
#ifndef CUDDER
		"jrcxz finito;"
#endif
		"xor %%eax, %%eax;"
		"inc %%eax;"
		"cqo;"
		"fibloop:;"
		"xchg %%eax, %%edx;"
		"add %%edx, %%eax;"
		"loop fibloop;"
		"finito:;"
		: "=d" (ret) : "c" (n) : "rax", "cc"
	);
	return ret;
}

// semantically close to the above
static inline long fib_slow(long n)
{
	long a,b,t;
	a = 0;
	b = 1;
	do {
		t = a+b;
		a = b;
		b = t;
	} while (--n > 0);
	return a;
}

// optimized, select left or right branch
static inline long fib_fast(long n)
{
	long a,b,t;
	a = 0;
	b = 1;
	while ((n -= 2) >= 0) {
		t = a+b;
		a = b;
		b = t;
		t = a+b;
		a = b;
		b = t;
	}
	return (n&1)?b:a;
}

int main()
{
	int i;
	long r = 0;
	for (i = 0; i < 100000000; i++)
#ifdef ASM
		r += fib_cudder(i & 47);
#elif defined(CSLOW)
		r += fib_slow(i & 47);
#elif defined(CFAST)
		r += fib_fast(i & 47);
#endif
	return r;
}

Now let's benchmark!

$ gcc -DASM -DCUDDER fib.c; time ./a.out 
^C
real	14m55.977s
user	14m55.940s
sys	0m0.000s

Sadly the original cudder's submission gets stuck in a loop for n=0. Meaning the throughput of his original implementation is about O(2^59) per one fib on average and it most likely gives incorrect result for n=0.

Ok, let's try it with slightly modified version, where n=0 is checked for.

$ gcc -DASM -O6 fib.c; time ./a.out 

real	0m4.713s
user	0m4.690s
sys	0m0.010s

Not too shabby, but still should be about magnitude faster for iterative fib.

Now try literal transcription of cudder's iterative version into C:

$ gcc -DCSLOW -O6 fib.c; time ./a.out 

real	0m1.195s
user	0m1.190s
sys	0m0.000s

Oh, it's 4x faster? But what about magnitude using a textbook iterative fib?

$ gcc -DCFAST -O6 fib.c; time ./a.out 

real	0m0.746s
user	0m0.740s
sys	0m0.000s

6.5x, close enough. I'm sure it could be 8x or slightly more in properly written asm (read compiler generated, and manually tidied up). Especially if you manage more pipelining than current 3 adds.

But why is C almost a magnitude faster than naive 386-era hand written assembly? Because humans suck at tracking data dependencies, the internal SSA CPU jits your x86 opcodes into and avoiding pipeline stalls. Compiler can do much, much better job. The amount of instructions you use is very poor metric for speed of code, yet this cargo cult from 386 era prevails :/

_{Note that you don't want to actually compute fibs if you need em in the real world. There are roughly nbits*1.5 numbers in the sequence for nbit n, so you can just do lookups.}

Can you optimize my fibs?

1 Name: Anonymous 2015-12-16 17:07

22 Name: Anonymous 2015-12-18 18:42

Name: Anonymous 2015-12-16 17:07

Name: Anonymous 2015-12-18 18:42