31
My apologies, it was much earlier I saw your code, you made a lot of progress since few weeks ago.
To make the C fast, you need to generate semantically same code you wrote in asm. Change hebi_uword __int128 and all its dependencies. Do do 2 mults at a time like you do asm, and you'll see:
49 for (k = 0; k < HEBI_PACKET_HALFWORDS; k++) {
50 prod = (hebi_uword)bhw[j+k] * multiplier + rhw[i+j+k] + overflow;
...
generic: 0.64530003 seconds
x86-64: 0.49140062 seconds
perf says the reason for the slowness is particular reordering of the add/adc chain clang generates, and unnecesary
shr
per each mult. Your asm does pretty much the right thing (unlike cudder's crap).
Sadly it's difficult to make this into pull req as it gives incorrect results in routines i didnt bother to fix and needs significant rewrite. You have to account for the following cases:
u16*u16=u32 (arm)
u32*u32=u64 (modern arm, fastmult)
u64*u64=u128 (x86-64, aa64)
The C code can be made universal for all those cases, just header stuff.
You can also first multiply the row in first step, and add carries later in second step. Which should be about 3x faster (both asm and c), but complicates the code significantly.
32
I don't have amd system anywhere nearby, but I highly doubt it can do the recurrence expansion for pipelining by itself. Just compile and run it and see for yourself.
n = 0 is not supposed to work. I define F(1) = 1. By introducing an extra instruction you've added over 16% more instructions to my solution.
First, your code wasn't according to op's spec.
Second, the branch miss happens 1/32 of times with this particular benchmark, and is practically unmeasurable.
You've also made everything a constant, meaning the compiler could optimise it into a single return (n) in the C case.
I wish it would, but I checked the asm output.
Just accept the reality that CPUs have very deep pipelines and you better pay attention to it. Not this 90s 256 .com demo crap. CPU JITs compile x86 to SSA IR on the spot and apply only peephole optimizations (ie removing redundant instructions). They can't "reason" about code, unlike compilers.