/prog/ - designing a suckless bignum library

Name: Anonymous 2015-11-16 22:11

Let's design a suckless bignum library. (I'm not part of suckless though, just curious about replacing GMP).

I researched a bit into algorithms and the rundown is this:
* long multiplication: O(n^2)
* karatsuba O(n^1.5)
* Toom-Cook, fourier transform based methods - even faster but only used for numbers 10k digits+ long. Much more complex.

So we should probably use karatsuba for all multiplications. Squaring can be done a bit faster than multiplying two different numbers sometimes.

Now I suggest programming it in assembly, that gives you access to the carry bit (C doesn't get you that). Of course we will use libc and the normal C calling conventions so that it's a regular C library.

What to do about memory management? e.g. if you want to add two numbers do we need to allocate a new 'number' as long as the largest to write the result into or do it destructively "x <- x + y"? Maybe the library should support both - then a calculator program would figure out the best primitives to use for a given computation.

It might be nice to also support things like (big modulus) modular arithmetic and polynomials. stuff like exponentiation and modular inverses have interesting algorithms.

What other integer operations would we want? I don't really want to do anything with arb. prec. real numbers - arithmetic with rationals could be done though.

Name: Anonymous 2015-12-06 4:22

>>80
Send that to the mailing list.

Name: Anonymous 2015-12-06 4:24

I'm suprised that with all the instruction set bloat accumulated over the years, Intel didn't add some bignum instructions somewhere along the way. Somethibg like movsb that manages edi and esi on its own and just needs to be invoked in a loop.

Name: Anonymous 2015-12-06 4:55

>>80
You also post on /g/ and/or /tech/, am I wrong?

Name: Cudder !cXCudderUE 2015-12-06 13:31

CPUID dispatching for the assembly functions

Stupid idea. The CPU is not going to suddenly change at runtime. Just do it compile-time instead.

>>82
I've always wanted a REP ADCS and REP SBBS, perhaps REPNZ to make it stop early would be good too. No need for different-size versions since you just specify the size in bytes and it should figure out how much to work on at a time (like REP MOVS can copy entire cachelines in one cycle).

Name: Anonymous 2015-12-06 21:27

>>84

Stupid idea. The CPU is not going to suddenly change at runtime. Just do it compile-time instead.

Oh, good, so you're finally advocating JIT. I agree that that's the best way to bake the runtime code to be as fast as it should, wherever it's run.

Name: Anonymous 2015-12-07 2:00

that's the best way to bake the runtime code to be as fast as it should

You are stupid

Name: suigin 2015-12-07 6:19

>>82
They added the ADX instructions, but they're more useful when optimizing multiplication and division algorithms.

>>83
No. I haven't posted on 4chan since world4ch was shut down, and the last time I visited /tech/ was over a month ago.

>>84
Yes. I thought about just selecting the best version at build time, or allowing the user to configure it somehow, and if everyone who would be interested in using it in their software expected their own users to be compiling from source, this wouldn't be an issue. But that's not the world we live in today. If one is interested in displacing the competition when it comes to multi-precision arithmetic, that means implementing some form of runtime dispatching to handle all use-cases.

However, at this point, I don't think it would be much more effort to add in build time selection of architecture-specific kernels as an option. Just need to use some assembler directives and include a `config.inc' file that can be customized by the user which specifies what their processor supports. Runtime dispatching would still be the default.

From a performance perspective, once a dynamically dispatched function has been run at least once, there's pretty much no overhead as long as the memory containing the function address remains in the L1 cache, at least on x86-64. The additional mov instruction to get the function address in a register usually gets pipelined away.

Name: Anonymous 2015-12-07 8:31

>>84

The CPU is not going to suddenly change at runtime. Just do it compile-time instead.

Not a great idea in the wild and wooly world of x86. Over 20 years of binary compatibility means you had better check what generation you are running on if you want to be sure you're executing anything near optional. People are not going to put up with separate binary distributions for every supported architecture, especially when it just doesn't matter for 80% of their code. Just ship them all, pick at runtime and be happy.

Name: Anonymous 2015-12-07 9:59

>>88
Excellent Hitler dubs!

Name: Anonymous 2015-12-07 14:25

The era of x86 is OVER.

Name: Anonymous 2015-12-07 15:34

>>88
Stop sending x86 binaries and only see them for x64 then.

Name: suigin 2015-12-07 17:08

>>91
There isn't a single optimal solution for x86-64 either. SSE2 is the only guaranteed instruction set extension. Since then, SSE3, SSSE3, SSE4.1, SSE4.2, AESNI, FP16C, CLMUL, AVX, FMA, AVX2, BMI1, BMI2, ERMSB, ADX, SHA, and AVX512 have been introduced, among others. While not all are relevant to arbitrary-precision arithmetic, many are.

Sure, you could say cut out all CPUs older than 5 years and go with AVX as a baseline, but many people still use 7+ year old hardware. One of my i7 820 desktop systems is over 6 years old now, and only supports SSE 4.2.

Name: Anonymous 2015-12-07 18:01

>>88,91,92
That's why you use source-based distributions and boycott SEPPLES

Name: Anonymous 2015-12-07 19:23

>>91
Not good enough. >>92 has it right; even if you limit yourself to AMD64 compatible implementations (which is a bad idea in any case - 32 bit mode is ideal for a great many applications), you still have to differentiate K8/Netburst/Core/Bulldozer, determine whether you have support for SSE3/SSE4/AVX...

The performance profile of arithmetic heavy code changes with every architectural iteration (2-3 years, on the outside). This is why Intel caught flak in the past for not bothering to implement anything other than a generic slow path for AMD hardware in code emitted by their C compilers. Just passively failing to optimize more aggressively for specific AMD implementations resulted in a measurable performance difference.

>>93
Compiling from source isn't always worthwhile (if you consider the full breadth of users, it almost never is). If the critical path is too large to be inlined, static linking alone may be enough.

For maximum portability, the ideal would be to keep performance critical code in a separate shared object for each target, and load the right one at runtime. This does require that application developers to stay aware of whether their own code is on the critical path or not. For larger shared codebases (e.g., mplayer) it's more common to just give up and statically link if it shows a performance improvement.

Name: Anonymous 2015-12-07 20:43

-round and -notround resource qualifiers

API 23 makes it easier to build apps for both round and square Android Wear watches. We listened to your feedback and added new resource qualifiers for -round and -notround, so you can use the resource system to load the appropriate images, layouts, and strings based on the type of watch you are working with. You can also combine this with existing resource qualifiers -hdpi, -tvdpi, -280dpi, and -360dpi for the various Android Wear watches that are currently available. All of the existing classes in the wearable UI library, such as WatchViewStub, BoxInsetLayout, and WearableFrameLayout will continue to work as well, so you do not need to change your code. The -round and -notround resource qualifiers will not work on API 22 devices, so you cannot assume they will be available until all devices are on API 23.

Name: Newprogrammer 2015-12-07 20:46

This whole subject is one big advertisement for JIT. Don't generate and distribute a billion different precompiled combinations, tell the system how to generate an "optimal" version for whatever it's running on.

You're not talking about hand-optmized asm anyway, given the dozens of variations just listed here.

Name: Anonymous 2015-12-08 3:44

>>96
JIT for static code? B-but the 90s are over. AOT compilers (of portable code) still blow JITs out of the water. Even Android 5 is AOT, because dalvik sucked.

Name: suigin 2015-12-08 3:47

>>96
Right.

So consider an optimizing JIT capable of targetting numerous, modern CPU architecture families and variants, with the ability to auto-vectorize code and generate machine instructions approaching equivalent hand-optimized code, and without requiring a garbage collector or a huge managed-code runtime.

There's only one existing project that I know of that meets that requirement: LLVM. But LLVM is quite a huge dependency, and not everyone likes polluting their system with something that's written in C++. Embedding LLVM as a JIT compiler is going to add 30MB or so to your executable size.

One thing I was thinking of, however, was writing reference implementations of the bignum kernels in LLVM IR or SPIR-V, then use the LLVM compiler to generate GNU assembly files for the different kernel variants. LLVM IR/SPIR-V have bignum and SIMD instructions, so this would get us perhaps 90% of the way there. A little sed/awk magic to glue it all altogether, and maybe a final pass of hand optimizing of the target assembly code for where LLVM is lacking to take things toward perfection.

The final GNU assembler files would then be all that is required to build the library. The LLVM IR/SPIR-V is just there to aid in adding support for a new architecture, and LLVM isn't required as a dependency. You would still ship all of the kernel variants together in the library and do runtime dispatching based on what your CPU supports.

Name: Anonymous 2015-12-08 3:49

dubs

Name: suigin 2015-12-08 3:53

>>98
The other added benefit of writing it in SPIR-V is that it would serve as a starting point for GPU/FPGA accelerating bignum computing with OpenCL and/or Vulkan, although you would likely want to transition over to using highly parallel algorithms.

Name: Anonymous 2015-12-08 5:31

>>98

LLVM-IR

Does not buy you anything compared to C. The IR is generally useful only if you generate it from some higher level language, not manually writing it. And yes, C has portable SIMD intrinsics (mapping very closely to LLVM because guess why).

Additionaly, SIMD does not buy you that much in bignum most of the time, except some very specific cases (FFT).

>>100

SPIR-V

With SPIR-V, it's the same exact thing. Just write your damn kernels in OpenCL.

As for FPGA - fuck no, you really want to write VHDL for that shit.

Name: Anonymous 2015-12-08 8:53

>>82
How's that going to work with multiplication, smart guy?

Name: Cudder !cXCudderUE 2015-12-08 10:12

From a performance perspective, once a dynamically dispatched function has been run at least once, there's pretty much no overhead as long as the memory containing the function address remains in the L1 cache, at least on x86-64. The additional mov instruction to get the function address in a register usually gets pipelined away.

If you really want to include all the variants -- of which no one will ever use more than 1 of (FUCK YOU FOR MAKING ME DOWNLOAD THIS USELESS BLOAT) --- at least copy the code directly into place so you don't even need to do that unnecessary work on every call.

It's as ridiculous as apps which somehow include 10MB of strings in over a dozen different languages. No one is ever going to use it in more than 1 language (probably English), you're just wasting everyone's disk space and bandwidth (including that of your server).

Name: Anonymous 2015-12-08 11:25

>>101
Yes, you can do that. The idea of using LLVM IR or SPIR-V is that the multi-precision arithmetic instructions, vector types, and loop unrolling/vectorization attributes are part of the spec. Doing the equivalent in C would mean using compiler specific extensions. I'd only be bother with supporting GCC and Clang, which isn't a problem for most people, but might be for some.

>>102
I have multiplication and division working where one addend is single word-sized value. Can do roundtrip conversions between strings and bignums now.

https://github.com/suiginsoft/hebimath/blob/master/src/packet/x86_64/mulp_u.S
https://github.com/suiginsoft/hebimath/blob/master/test/unit/get_set.c

I haven't benchmarked anything, there's a lot of room for optimization in how the conversions are done. For example, using shifting for power of two bases, and fixed-point multiplication for the division.

I will try to get to Karatsuba by this weekend (I'm working during the days).

>>103
If you're going to be compiling from source, you'll be able to spend the 5 minutes configuring it to strip out the unneeded kernel variants.

Name: suigin 2015-12-08 11:55

>>102
Tired, didn't realize you were talking about something else entirely.

Time to retire to a higher plane.

Name: Anonymous 2015-12-08 18:06

llvm is never the solution to anything "suckless"

Name: Anonymous 2015-12-08 18:10

>>86
x86 GET

Name: Anonymous 2015-12-08 19:43

>>104
Checked the code, and it has a very strong data dependency chain (disregarding the loop, which can be fixed fairly easily). It would be fast on 486 which does not execute 30+ ALU operations in one tick (if you break data deps). Is there any reason why not write it in C?

Try it. You'll see the overhead of C is miniscule, because your bottleneck isn't the slight C overhead, but the huge latency introduced by carries.

Delayed carry, together with FFT would of course blow your implementation out of water by several orders of magnitude.

Note that heavy-duty kernels (ie not naive schoolbook) in GMP et al are written in asm to improve register scheduling because the kernel is _large_ (FFT, toom-3, montgomery...) with considerable register pressure. While compilers are often better than humans wrt instruction scheduling (especially ICC), they often do poor job of figuring out the spill schedule inside a hot loop. So the way it's usually done is let compiler produce the code, and then human hacks it in trial and error fashion to squeeze less ticks out of it.

Name: Anonymous 2015-12-08 20:05

>>103
I miss the real Cudder.

If you really want to include all the variants -- of which no one will ever use more than 1 of (FUCK YOU FOR MAKING ME DOWNLOAD THIS USELESS BLOAT) --- at least copy the code directly into place so you don't even need to do that unnecessary work on every call.

If it's already in L1, it's not extra work. The only extra work involved would be if someone actually followed your pointless suggestion to write self modifying code where there's no benefit to doing so.

It's as ridiculous as apps which somehow include 10MB of strings in over a dozen different languages. No one is ever going to use it in more than 1 language (probably English), you're just wasting everyone's disk space and bandwidth (including that of your server).

No, you're wasting everyone's time by forcing them to make sure they pick the correct locale at install time and never change it afterward. l18n string tables are fucking tiny compared to the available space on most systems. If you really care so much, the build process for the app will likely permit you to strip out the ones you don't need. Unless the app developer actually makes this difficult there is no point in complaining.

Name: Anonymous 2015-12-09 9:31

>>109
me 2
Cudder is my waifu

Name: Anonymous 2015-12-09 10:03

trips

Name: Anonymous 2015-12-09 10:18

>>111
Sweet!

Name: Anonymous 2015-12-10 3:17

>>1
FUCK OFF YOUR ANIME SUCKS, YOUR HARDWARE SUCKS, GO FUCK CUDDER YOU SHRIMP

Name: Anonymous 2015-12-10 6:15

What would LAC do in a situation like this?

Name: Anonymous 2015-12-10 7:48

>>114
Probably call us a bunch of stack boys.

Name: Cudder !cXCudderUE 2015-12-10 11:13

If it's already in L1, it's not extra work

You forgot the fact that L1 is shared by all the code running on the CPU, and if your code is taking up some portion of it, it got there by forcing some other code out. Not realising this is why microbenchmarking is stupid. Faster in this small part makes everything else slower.

No, you're wasting everyone's time by forcing them to make sure they pick the correct locale at install time and never change it afterward.

Waiting a lot longer (it adds up really quickly if everyone does this) for each download, along with extracting and installing shit I don't need and probably never will, or an extra second or two of nearly no thought?

l18n string tables are fucking tiny compared to the available space on most systems

"The amount of chemicals we dump into the ocean is fucking tiny compared to the amount of water in there". That didn't turn out so well, did it? It's not just YOU who will be doing it, if you're saying everyone should be.

http://i.stack.imgur.com/yDd8C.png

Look at the size of all the Windows language packs. That's not even all of them and it's 1GB+ already. Do you want to wait for another hour or two, paying for the power and bandwidth, downloading data which is absolutely useless to you?

Name: suigin 2015-12-11 11:49

>>116
Are you happy? It's not fully working yet, but it will be soon. Baka.

https://github.com/suiginsoft/hebimath/blob/master/config.mk

As a consequence, I now generate hebimath.h from a simple template processed by an awk one-liner.

Name: suigin 2015-12-13 13:53

>>108
Thanks. I now have a generic version in C. It's still over twice as slow as the assembly optimized version.

It's hard to get it to generate good code without venturing outside of the standard. The largest guaranteed integer type is of course unsigned long long (or uintmax_t, which is a typedef for unsigned long long on most platforms). So on i386 or x86-64 for example, there's no standard way to capture the high 64-bits of a 64x64->128 multiply. You are stuck with doing 32x32->64 bit multiplications in C, on these platforms.

Name: Anonymous 2015-12-13 17:31

>>118
long long long cat;

Name: suigin 2015-12-14 13:05

>>108,118
Was going to do a backend specifically for Clang using it's language extensions, but turns out it generates suboptimal code. Instead of keeping track of the carry flag in eflags, it uses sbb to move the carry flag into a gpr. It performs no better than the version written in standard C.

I'm surprised they still can't generate optimal code for arbitrary-precision arithmetic, even with intrinsics, after all these years.

designing a suckless bignum library

1 Name: Anonymous 2015-11-16 22:11

81 Name: Anonymous 2015-12-06 4:22

82 Name: Anonymous 2015-12-06 4:24

83 Name: Anonymous 2015-12-06 4:55

84 Name: Cudder !cXCudderUE 2015-12-06 13:31

85 Name: Anonymous 2015-12-06 21:27

86 Name: Anonymous 2015-12-07 2:00

87 Name: suigin 2015-12-07 6:19

88 Name: Anonymous 2015-12-07 8:31

89 Name: Anonymous 2015-12-07 9:59

90 Name: Anonymous 2015-12-07 14:25

91 Name: Anonymous 2015-12-07 15:34

92 Name: suigin 2015-12-07 17:08

93 Name: Anonymous 2015-12-07 18:01

94 Name: Anonymous 2015-12-07 19:23

95 Name: Anonymous 2015-12-07 20:43

96 Name: Newprogrammer 2015-12-07 20:46

97 Name: Anonymous 2015-12-08 3:44

98 Name: suigin 2015-12-08 3:47

99 Name: Anonymous 2015-12-08 3:49

100 Name: suigin 2015-12-08 3:53

101 Name: Anonymous 2015-12-08 5:31

102 Name: Anonymous 2015-12-08 8:53

103 Name: Cudder !cXCudderUE 2015-12-08 10:12

104 Name: Anonymous 2015-12-08 11:25

105 Name: suigin 2015-12-08 11:55

106 Name: Anonymous 2015-12-08 18:06

107 Name: Anonymous 2015-12-08 18:10

108 Name: Anonymous 2015-12-08 19:43

109 Name: Anonymous 2015-12-08 20:05

110 Name: Anonymous 2015-12-09 9:31

111 Name: Anonymous 2015-12-09 10:03

112 Name: Anonymous 2015-12-09 10:18

113 Name: Anonymous 2015-12-10 3:17

114 Name: Anonymous 2015-12-10 6:15

115 Name: Anonymous 2015-12-10 7:48

116 Name: Cudder !cXCudderUE 2015-12-10 11:13

117 Name: suigin 2015-12-11 11:49

118 Name: suigin 2015-12-13 13:53

119 Name: Anonymous 2015-12-13 17:31

120 Name: suigin 2015-12-14 13:05