Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

Can you optimize my fibs?

Name: Anonymous 2015-12-16 17:07

Hi /frog/,

I've been tweaking my fibs implementation and it still seems slow in some cases.

How could I make it faster? Any ideas? Any obvious optimilizations I'm missing here?

fibs:
cmp rdi, 1
ja .l1
mov rax, rdi
ret
.l1:
dec rdi
push rdi
call fibs
pop rdi
push rax
dec rdi
call fibs
pop rcx
add rax, rcx
ret

Name: http://jewtube.com/ 2015-12-16 17:27

are you and idiot? Use binary exponentiation

Name: http://jewtube.com/ 2015-12-16 17:28

recursion
in assembly

pushing every fucking thing on the stack

Just use C.. what the hell is this.

Name: Anonymous 2015-12-16 17:42

>>3
It would appear as though you have mistakenly quoted text that had not been previously typed. Please correct these mistakes in any and all future posts.

Name: Anonymous 2015-12-16 18:00

>>4
who are you le quoteeing XD ???

Name: Anonymous 2015-12-16 18:02

>>4
Are you frustrated?

Name: Anonymous 2015-12-16 18:05

How to spot someone who has not read his SICP: His fibs are not logarithmic.

Name: Anonymous 2015-12-16 18:14

>>3,6
Who are you quoting?

Name: Anonymous 2015-12-16 20:49

>>3
The reason I don't use C.. It's too slow. Have you checked what code GCC and clang generate? Terrible. Just terrible. It would make my function even slower.

Name: Chad, from Marketing 2015-12-16 21:42

babe, you know I don't use tail call optimization

Name: Anonymous 2015-12-17 0:46

>>8
Whom

Name: Anonymous 2015-12-17 1:01

>>1
Use an iterative algorithm instead of recursive for your fibs. Recursion is slower than the iterative method here.

Name: Cudder !cXCudderUE 2015-12-17 3:45

:fib xor eax eax | inc eax | cdq
:fibloop xchg eax edx | add eax edx | loop .fibloop

Name: Anonymous 2015-12-17 7:41

>>13
bretty good

Name: Anonymous 2015-12-17 11:54

>>13
I remember `loop` being very slow instruction in some older X86 processors. Is that fixed in modern ones?

Name: Anonymous 2015-12-17 18:25

>>15
modern cpus can fuse a conditional jump with previous arithmetic (cmp/sub/dec + jnz for example) into one uop and execute in one cycle (assuming branch is predicted) whereas loop is like 7 uops and around 5 cycles, so it's significantly slower.

Name: Anonymous 2015-12-17 19:28

>>13
Doesn't work with 64-bit numbers, so solution by >>1 is much better.

Name: Cudder !cXCudderUE 2015-12-18 5:20

>>15
It's slower on Intel, but exactly the same speed as dec/jnz on AMD CPUs. There's no real reason why it must be slower, but I suppose if others start using it more that could force Intel to fix it. In any case it's still far faster (and smaller) than OP's solution.

>>17
s/e/r/g
s/cdq/cqo/

Name: Anonymous 2015-12-18 10:26

still far faster (and smaller) than OP's solution
Yea bit it works only for 32-bit numbers and doesn't follow any calling convention so you can't actually call it. So it's useless while OP's versiion actually has some value.

Name: Anonymous 2015-12-18 13:12

>>19
and doesn't follow any calling convention
as cudder said in the past
``I make my own conventions''

Name: Anonymous 2015-12-18 17:36

>>8
It's "whom" in this case, fucktard. If you're going to whine about quality posting on /prog/ at least use proper grammar.

Name: Anonymous 2015-12-18 18:42

THE RACE REALISM BEHIND C SUPREMACY

Nice topic, OP. Your example is too slow to run any meaningful benchmark, sorry, but let's see about some other entries why it's tricky to write asm these days:

static inline long fib_cudder(long n)
{
long ret;
__asm__ (
#ifndef CUDDER
"jrcxz finito;"
#endif
"xor %%eax, %%eax;"
"inc %%eax;"
"cqo;"
"fibloop:;"
"xchg %%eax, %%edx;"
"add %%edx, %%eax;"
"loop fibloop;"
"finito:;"
: "=d" (ret) : "c" (n) : "rax", "cc"
);
return ret;
}

// semantically close to the above
static inline long fib_slow(long n)
{
long a,b,t;
a = 0;
b = 1;
do {
t = a+b;
a = b;
b = t;
} while (--n > 0);
return a;
}

// optimized, select left or right branch
static inline long fib_fast(long n)
{
long a,b,t;
a = 0;
b = 1;
while ((n -= 2) >= 0) {
t = a+b;
a = b;
b = t;
t = a+b;
a = b;
b = t;
}
return (n&1)?b:a;
}

int main()
{
int i;
long r = 0;
for (i = 0; i < 100000000; i++)
#ifdef ASM
r += fib_cudder(i & 47);
#elif defined(CSLOW)
r += fib_slow(i & 47);
#elif defined(CFAST)
r += fib_fast(i & 47);
#endif
return r;
}


Now let's benchmark!

$ gcc -DASM -DCUDDER fib.c; time ./a.out
^C
real 14m55.977s
user 14m55.940s
sys 0m0.000s


Sadly the original cudder's submission gets stuck in a loop for n=0. Meaning the throughput of his original implementation is about O(2^59) per one fib on average and it most likely gives incorrect result for n=0.

Ok, let's try it with slightly modified version, where n=0 is checked for.

$ gcc -DASM -O6 fib.c; time ./a.out

real 0m4.713s
user 0m4.690s
sys 0m0.010s


Not too shabby, but still should be about magnitude faster for iterative fib.

Now try literal transcription of cudder's iterative version into C:
$ gcc -DCSLOW -O6 fib.c; time ./a.out

real 0m1.195s
user 0m1.190s
sys 0m0.000s


Oh, it's 4x faster? But what about magnitude using a textbook iterative fib?

$ gcc -DCFAST -O6 fib.c; time ./a.out

real 0m0.746s
user 0m0.740s
sys 0m0.000s


6.5x, close enough. I'm sure it could be 8x or slightly more in properly written asm (read compiler generated, and manually tidied up). Especially if you manage more pipelining than current 3 adds.

But why is C almost a magnitude faster than naive 386-era hand written assembly? Because humans suck at tracking data dependencies, the internal SSA CPU jits your x86 opcodes into and avoiding pipeline stalls. Compiler can do much, much better job. The amount of instructions you use is very poor metric for speed of code, yet this cargo cult from 386 era prevails :/

Note that you don't want to actually compute fibs if you need em in the real world. There are roughly nbits*1.5 numbers in the sequence for nbit n, so you can just do lookups.

Name: Anonymous 2015-12-18 18:47

>>22
-O6
HOLY COW I'M TOTALLY GOING SO FAST OH FUCK

Name: Anonymous 2015-12-18 18:55

>>22
I'm sure it could be 8x or slightly more in properly written asm (read compiler generated, and manually tidied up).

This isn't always the case, if you look in the bignum thread, suigin dispelled the myth that the compiler knows best. He's outperforming the compiler by a factor of 4x for bignum multiplication.

Name: Anonymous 2015-12-18 19:01

>>23
I KNOW, FUNROLL ALL THE THINGS

>>24
I'm the same C-supremacist from that thread. Unfortunately he didnt implement semantically exact C equivalent, no side by side comparison, no scientific method. Wishful thinking, as expected from anime degenerates.

Name: Anonymous 2015-12-18 19:02

Have you considered a look-up table?

Name: Anonymous 2015-12-18 19:19

>>22
fibs ... in the real world

LoLoLoLoLoLoLotflcopter!

Name: Anonymous 2015-12-18 19:34

>>27
There are rare use cases, like when you need successive numbers with unique prime factors in fancy groups. But usually with numbers much larger than word size. Fib is faster than trial factoring for that.

Name: Anonymous 2015-12-18 19:57

>>22

Could you do some benchmarks on suckless C compiler? http://git.suckless.org/scc/

I think aiming for a very simple compiler (code-wise) could lead to far from optimized results (output asm).

If their compiler is worse than gcc/clang, I don't give a damn about how it's "cleaner" and "simpler".

Name: Anonymous 2015-12-19 0:54

>>29
It has no useable codegen yet. I presume it would be like tcc - ok if you need to compile c code quickly with low memory usage, but the resulting code would be 100x slower than gcc/clang in hot loops.

Speaking of clang, pleasant surprise:

$ clang -DCSLOW -O3 fib.c; time ./a.out

real 0m0.494s
user 0m0.490s
sys 0m0.000s


The CSLOW version gets expanded into 4-step memoize, which makes it slightly faster than CFAST (CFAST is 2-at-a-time, but it confuses the optizer a bit):

addq %rcx, %rdx
addq %rdx, %rcx
addq %rcx, %rdx
addq %rdx, %rcx
addq $-4, %rdi
testq %rdi, %rdi
jg .LBB0_12


Which is pretty neat, as you don't need to do that by hand. Moral of the story: clang is a bit "dumber" when emitting code (see that superfluous testq), but it applies strong algorithmic optimizations instead which can be a huge win in the end.

Note that removing that 'testq' didnt change the result at all, because CPU IR representation has to fetch the carry one way or another, the whole impact of more verbose code is icache.

Name: suigin 2015-12-19 3:35

>>25
If you can get the compiler to generate code that beats my hand-written assembly, I'll gladly accept patches.

Name: Cudder !cXCudderUE 2015-12-19 3:37

>>22
Like I said, it is several times slower on Intel CPUs. Try AMD instead. (Then everyone using an Intel one can go complain to them about how AMD is better, maybe they'll fix loop in the next model.)

n = 0 is not supposed to work. I define F(1) = 1. By introducing an extra instruction you've added over 16% more instructions to my solution.

You've also made everything a constant, meaning the compiler could optimise it into a single return (n) in the C case.

Now do a real efficiency comparison (size/time, smaller is better) and you'll find your bloated C compiler generated output is far behind...

Name: suigin 2015-12-19 4:22

>>32
It's a shame we have to wait to Q4 2016 for Zen. My 6 year old Core i7 860 still outperforms the latest AMD processors, although by a small margin.

I very much don't want to use Intel's compromised hardware anymore.

Name: Anonymous 2015-12-19 9:11

Can you breed my negs?

Name: Anonymous 2015-12-19 10:03

>>33
What did you expect when most applications depend on single core performance?

Name: Chad, from Marketing 2015-12-20 1:15

>>29
Suckless people are eventually going to realize that complicated problems often require complicated solutions. You won't get far if you sacrifice the things that get you something concrete, like how fast the generated code is, or how small, or what features it supports, for things that, while still important, don't necessarily affect the final product. Extremes of practicality and idealism are likely lead to long or short term failure and both are required in healthy amounts to succeed.

Name: Anonymous 2015-12-20 1:27

Which is a bigger shitpile, cat-v or suckless?

Name: Anonymous 2015-12-20 1:31

cat-v is absolute shit.

at least suckless can program

Name: Anonymous 2015-12-20 1:53

31
My apologies, it was much earlier I saw your code, you made a lot of progress since few weeks ago.

To make the C fast, you need to generate semantically same code you wrote in asm. Change hebi_uword __int128 and all its dependencies. Do do 2 mults at a time like you do asm, and you'll see:
49 for (k = 0; k < HEBI_PACKET_HALFWORDS; k++) {
50 prod = (hebi_uword)bhw[j+k] * multiplier + rhw[i+j+k] + overflow;
...
generic: 0.64530003 seconds
x86-64: 0.49140062 seconds


perf says the reason for the slowness is particular reordering of the add/adc chain clang generates, and unnecesary shr per each mult. Your asm does pretty much the right thing (unlike cudder's crap).

Sadly it's difficult to make this into pull req as it gives incorrect results in routines i didnt bother to fix and needs significant rewrite. You have to account for the following cases:

u16*u16=u32 (arm)
u32*u32=u64 (modern arm, fastmult)
u64*u64=u128 (x86-64, aa64)

The C code can be made universal for all those cases, just header stuff.

You can also first multiply the row in first step, and add carries later in second step. Which should be about 3x faster (both asm and c), but complicates the code significantly.

32
I don't have amd system anywhere nearby, but I highly doubt it can do the recurrence expansion for pipelining by itself. Just compile and run it and see for yourself.

n = 0 is not supposed to work. I define F(1) = 1. By introducing an extra instruction you've added over 16% more instructions to my solution.

First, your code wasn't according to op's spec.
Second, the branch miss happens 1/32 of times with this particular benchmark, and is practically unmeasurable.

You've also made everything a constant, meaning the compiler could optimise it into a single return (n) in the C case.

I wish it would, but I checked the asm output.

Just accept the reality that CPUs have very deep pipelines and you better pay attention to it. Not this 90s 256 .com demo crap. CPU JITs compile x86 to SSA IR on the spot and apply only peephole optimizations (ie removing redundant instructions). They can't "reason" about code, unlike compilers.

Don't change these.
Name: Email:
Entire Thread Thread List