Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

designing a suckless bignum library

Name: Anonymous 2015-11-16 22:11

Let's design a suckless bignum library. (I'm not part of suckless though, just curious about replacing GMP).

I researched a bit into algorithms and the rundown is this:
* long multiplication: O(n^2)
* karatsuba O(n^1.5)
* Toom-Cook, fourier transform based methods - even faster but only used for numbers 10k digits+ long. Much more complex.

So we should probably use karatsuba for all multiplications. Squaring can be done a bit faster than multiplying two different numbers sometimes.

Now I suggest programming it in assembly, that gives you access to the carry bit (C doesn't get you that). Of course we will use libc and the normal C calling conventions so that it's a regular C library.

What to do about memory management? e.g. if you want to add two numbers do we need to allocate a new 'number' as long as the largest to write the result into or do it destructively "x <- x + y"? Maybe the library should support both - then a calculator program would figure out the best primitives to use for a given computation.

It might be nice to also support things like (big modulus) modular arithmetic and polynomials. stuff like exponentiation and modular inverses have interesting algorithms.

What other integer operations would we want? I don't really want to do anything with arb. prec. real numbers - arithmetic with rationals could be done though.

Name: Anonymous 2015-11-20 7:21

You're fucking welcome.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define swap(__a,__b); { __a = __a ^ __b; __b = __a ^ __b; __a = __a ^ __b; }

char* shift_left(char* src){
int len = strlen(src);
char* moved = calloc(len - 1, sizeof(char));
memcpy(moved, src + 1, len - 1);
return moved;
}

char* add(char* addend, char* addend2){
int len = strlen(addend), a2len = strlen(addend2);
int swapped = 0, result_digits = 0;

if (len < a2len){
swap((int)addend, (int)addend2);
swap(len, a2len);
swapped = 1;
}
char* a2resized = calloc(len + 1, sizeof(char));
memset(a2resized, '0', len);
a2resized[len] = 0;
memcpy(a2resized + len - a2len, addend2, a2len);

int reslen = len + 1;
char* result = calloc(reslen, sizeof(char));
memset(result, '0', reslen);
int carry = 0;
for (int i = 0; i < len; i++){
int respos = reslen - i - 1;
int a1val = addend[len - i - 1] - 48, a2val = a2resized[len - i - 1] - 48;
int val = a1val + a2val + carry;
carry = 0;
if (val >= 10){
result[respos] = (val - 10) + 48;
carry = 1;
}
else{
result[respos] = val + 48;
}
}
free(a2resized);
if (swapped == 1) swap((int)addend, (int)addend2);
if (carry == 1){
result[0] = 49;
result[reslen] = '\0';
}else{
char* shifted_result = shift_left(result);
swap((int)result, (int)shifted_result);
free(shifted_result);
result[reslen-1] = '\0';
}
return result;
}

int main(int argc, char *argv[]){
char* a = "99999999999999999999";
char* b = "99999999999999999999";
char* c;
char* d = "9999999999999990";
swap((int)a, (int)b);
c = add(a, b);
printf("%s + %s = %s\n", a, b, c);
char* res = add(d, c);
printf("%s + %s = %s\n", d, c, res);
return 0;
}

Name: Anonymous 2015-11-20 16:39

op, if you're serious about this, do:

* modular arith only (ie, crypto only). for fast infinite precision, you really need hairy ball of code like GMP, no way to be suckless
* shift-and-add multiplier, then you get modulo for free
* Why not schoolbook & barrett reduction? Sure it works better on superscalar CPUs with ALUs, but it *will* suck horribly on a CPU without multiplier (which is everything really low power these days)
* basics, ie modular add,sub,mul,pow are ~500loc. prime tester (miller-rabin) 100 loc. other bells and whistles (ie actual PKCS+RSA, DHE, maybe even ECDHE) 1kloc. 2kloc all.

What to do about memory management?

The golden rule of suckless C - just forget dynamic memory. Dynamically allocated memory was a mistake. K&R were right. You free memory through return or exit(). Just cap your bignum type to some hardcoded limb count. Sacrifice a bit of stack depth, and and no need for silly realloc() every time you add two numbers. Not to mention then you don't even need libc and can run baremetal easily.

As an example of bignum of this style, here is 60 loc RSA signature verifier for a (very) ROM space constrained embedded MCU bootloader:

#define RSA_BYTES (RSA_BITS/8)
#define DIGIT uint32_t
#define DIGIT_BYTES (sizeof(DIGIT))
#define DIGIT_BITS (DIGIT_BYTES*8)
#define DIGIT_MAX ((DIGIT)(-1))
#define NDIGITS (RSA_BITS/DIGIT_BITS)

static int b_add(DIGIT * restrict r, const DIGIT *x, const DIGIT *y)
{
DIGIT w, carry = 0;
for (int i = 0; i < NDIGITS; i++) {
if ((w = x[i] + carry) < carry)
w = y[i];
else
carry = ((w += y[i]) < y[i]);
r[i] = w;
}
return carry;
}

static int b_mulmod(DIGIT * restrict res, const DIGIT *xsrc, const DIGIT *y, const DIGIT *mod)
{
DIGIT rbuf1[NDIGITS], rbuf2[NDIGITS], xbuf1[NDIGITS], xbuf2[NDIGITS];

DIGIT *r1 = rbuf1;
DIGIT *r2 = rbuf2;
DIGIT *x1 = xbuf1;
DIGIT *x2 = xbuf2;

DIGIT *swp;

memset(rbuf1, 0, sizeof rbuf1);
memcpy(xbuf1, xsrc, sizeof xbuf1);

for (int i = 0; i < NDIGITS; i++) {
for (DIGIT bit = 1; bit; bit += bit) {
if (y[i] & bit) {
if (b_add(r2, r1, x1))
return -1;
if (!b_add(r1, r2, mod)) {
swp = r1;
r1 = r2;
r2 = swp;
}
}
if (b_add(x2, x1, x1))
return -1;

if (!b_add(x1, x2, mod)) {
swp = x1;
x1 = x2;
x2 = swp;
}
}
}
memcpy(res, r1, sizeof rbuf1);
return 0;
}

static int rsa_public(DIGIT *output, const DIGIT *input, const DIGIT *modulus)
{
DIGIT buf2[NDIGITS];
/* buf2 = buf^2 % modulus */
if (b_mulmod(buf2, input, input, modulus))
return -1;
/* buf3 = buf^3 % modulus */
return b_mulmod(output, buf2, input, modulus);
}


(modulus is stored negated, thus no need for b_sub)

Name: Anonymous 2015-11-21 4:24

>>35
>>41
Why not just use libtommath? It's already recommended by suckless. The only problem is that it being pure C, isn't as optimized as it could be if it were hand-coded assembly.

It's right near the top as an alternative to GMP: http://suckless.org/sucks/

Name: Anonymous 2015-11-21 17:17

>>48
Indeed libtom is very close to what
>>42
is suggesting - a good compromise in not being too slow for somewhat general use, but still far from a good choice for heavy (ie non-crypto) number crunching. That's why it might be wiser to strive for some absolute reductionism (fe specialize to embedded and crypto, and then you can do a lot under 2kloc without being too slow) and not directly compete with tommath.

The problem here is that general fast infinite precision is simply collection of dozens of algorithms (libtom is far beyond 2kloc suckless limit :), the more you specialize for various cases (including heuristics to auto choose method to do something, for easier use), the "faster" is the lib overall. The disadvantage is eventually you end up with a kitchen sink like GMP, NTT or ARPREC.

>>43
Engineering is empirical - hearsays and superstitions. You can find rigorous proofs of complexity of individual algorithms, but not for bignum in general, as that depends on what number theoretic tricks you actually decide to implement.

>>48
Assembly
Ah, just drop it. Hand written asm (of which there is not immense amount in GMP, just inner loop kernels) can squeeze say, 10-20% more. But that makes sense only after you exhausted algorithmic optimizations which can be often much faster. For example NTT or ARPREC can be still magnitude faster than GMP in some cases (say, non-modular multiplication), despite being pure C++.

I think that tommath indeed is pretty close to to GMP in terms of implemented tricks, but lacks ntt, and asm kernels IIRC.

Name: Anonymous 2015-11-21 19:00

>>51
Really strong points. Gotta rethink the scope of this somewhat to be able to pull it off.

regarding assembly: But I don't want to have to do some stupid thing that relies on the compiler pattern matching against it to figure out i meant the carry bit. I just want to say 'check the carry bit'.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List