GMP Itemized Development Tasks

An up-to-date html version of this file is available at http://www.swox.com/gmp/tasks.html.

This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the file projects, or http://www.swox.com/gmp/projects.html for more sizeable projects.

Correctness and Completeness

HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the new instructions. Unfortunately, HPUX 9 and earlier assemblers rejects that directive. How very clever of HP. We will have to pass assembler options, and make sure it works with new and old systems and GNU assembler.
The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit mp_exp_t, the precision of __mp_bases[base].chars_per_bit_exactly is insufficient and mpf_get_str aborts. Detect and compensate.
Fix mpz_get_si to work properly for MIPS N32 ABI (and other machines that use long long for storing limbs.)
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified.
The probabilistic prime function(s) should never return a different result just because of the state of an independent random generator. Therefore, make it use the new random number functions (see accompanying projects file.)
In the development sources, we return abs(a%b) in the mpz_*_ui division routines. Perhaps make them return the real remainder instead? Changes return type to signed long int.
libmp/mout calls strlen with incompatible type.
libmp/mtox returns from incompatible type.
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq.
mpn_divrem cannot throw away the quotient when XSIZE != 0. Document or fix.
Install Alpha assembly changes (prec/gmp-alpha-patches).
NeXT has problems with newlines in asm strings in longlong.h. Also, __builtin_constant_p is unavailable? Same problem with MacOS X.
Shut up SGI's compiler by declaring dump_abort in mp?/tests/*.c.
mpz_get_si returns 0x80000000 for -0x100000000.

Machine Independent Optimization

In hundreds of places in the code, we invoke count_leading_zeros and then check if the returned count is zero. Instead check the most significant bit of the operand, and avoid invoking count_leading_zeros if the bit is set. This is an optimization on all machines, and significant on machines with slow count_leading_zeros.
In a couple of places count_trailing_zeros is used on more or less uniformly distributed numbers. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
Change all places that use udiv_qrnnd for inverting limbs to instead use mpn_invert_limb.
Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Write new mpn_get_str and mpn_set_str running in the sub O(n^2) range, using some divide-and-conquer approach, preferably without using division.
Copy tricky code for converting a limb from development version of mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
Consider inlining these functions: mpz_size, mpz_set_ui, mpz_set_q, mpz_clear, mpz_init, mpz_get_ui, mpz_scan0, mpz_scan1, mpz_getlimbn, mpz_init_set_ui, mpz_perfect_square_p, mpz_popcount, mpf_size, mpf_get_prec, mpf_set_prec_raw, mpf_set_ui, mpf_init, mpf_init2, mpf_clear, mpf_set_si.

Machine Dependent Optimization

Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21264. On 21264, they should run at 4, 3, and 3 cycles/limb respectively, if the code is unrolled properly. (Ask Torbjörn for his xm.s and xam.s skeleton files.)
Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks.
UltraSPARC: Rewrite 64-bit mpn_add_n and mpn_sub_n. The current sparc64 code uses MOVcc instructions, which take about 6 cycles on UltraSPARC. The correct approach is probably to use conditional branching. That should lead to loops that run at 4 cycles/limb. (Torbjörn has code that just needs to be finished.)
UltraSPARC: Rewrite 64-bit mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Should use floating-point operations, and split the invariant single-limb multiplier into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline will become very deep. (Torbjörn has C code that is useful as a starting point.)
UltraSPARC: Rewrite mpn_lshift and mpn_rshift. Should give 2 cycles/limb. (Torbjörn has code that just needs to be finished.)
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current development code runs at 11 cycles/limb, which is already very good. But it should be possible to saturate the cache, which will happen at 7.5 cycles/limb.
UltraSPARC: Write umul_ppmm. Important in particular for mpn_sqr_basecase.
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
Alpha: Optimize count_leading_zeros.
Alpha: Optimize udiv_qrnnd. (Ask Torbjörn for the file test-udiv-preinv.c as a starting point.)
R10000/R12000: Rewrite mpn_add_n and mpn_sub_n. It should just require 3 cycles/limb, but the current code propagates carry poorly. The trick is to add carry-in later than we do now, decreasing the number of operations used to generate carry-out from 4 to to 3.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5 is unused. (Ask Torbjörn for a copy of the current code.)
PPC32: Write mpn_rshift based on new mpn_lshift.
PPC32: Rewrite mpn_add_n and mpn_sub_n. Should run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
Cray: Vectorize main functions, perhaps in assembly language.
Cray: Write mpn_mul_basecase and mpn_sqr_basecase. Same comment applies to this as to the same functions for Fujitsu VPP.
Integrate Paul Zimmermann's FFT multiply code. (If you want to work on this, ask Torbjörn for the code.)

Improve count_leading_zeros for 64-bit machines:

  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
  ...

New Functionality

mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
Add mpq_set_f for assignment from mpf_t (cf. mpq_set_d).
Maybe make mpz_init (and mpq_init) do lazy allocation. Set ALLOC(var) to 0, and have mpz_realloc special-handle that case. Update functions that rely on a single limb (like mpz_set_ui, mpz_[tfc]div_r_ui, and others).
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines.
Handle numeric exceptions: Call an error handler, and/or set gmp_errno.
Implement gmp_fprintf, gmp_sprintf, and gmp_snprintf.
Implement some mpq input and output functions.
Implement mixed precision Kronecker/Jacobi symbol, extend the normal mpz_jacobi to a full range of inputs. Kevin Ryde is working on this.

Miscellaneous

Work on the way we build the library. We now do it building convenience libraries but then listing all the object files a second time in the top level Makefile.am.
Get rid of mp[zq]/sub.c, and instead define a compile parameter to mp[zq]/add.c to decide whether it will add or subtract. Will decrease redundancy. Similarly in many places.
Make mpz_div and mpz_divmod use rounding analogous to mpz_mod. Document, and list as an incompatibility.
Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new mpn/generic/pow_ui.
Make mpz_invert call mpn_gcdext directly.
Make an option for stack-alloc.c to call malloc separately for each TMP_ALLOC block, so a redzoning malloc debugger could be used during development.

Documentation

Document conventions, like that unsigned long int is used for bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.