AMD K7 MPN SUBROUTINES This directory contains code optimized for the AMD Athalon CPU. The mmx subdirectory has routines using mmx instructions. All Athalons have MMX, the separate directory is just so that configure can omit it if the assembler doesn't support MMX. STATUS Times for the loops, with all code and data in L1 cache. cycles/limb mpn_add/sub_n 1.6 mpn_copyi 0.75 or 1.0 \ varying with data alignment mpn_copyd 0.75 or 1.0 / mpn_divrem_1 42.0 mpn_mod_1 42.0 mpn_l/rshift 1.2 mpn_mul_1 3.4 mpn_addmul/submul_1 3.9 mpn_mul_basecase 4.42 cycles/crossproduct (approx) Prefetching of sources hasn't yet been tried. NOTES cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available. Write-allocate L1 data cache means prefetching of destinations is unnecessary. Floating point multiplications can be done in parallel with integer multiplications, but there doesn't seem to be any way to make use of this. Unsigned "mul"s take 3 cycles executed back-to-back, so long as eax is loaded so it doesn't cause a dependency on the previous mul. This suggests 3 is a limit on the speed of the multiplication routines. The documentation shows mul executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that, to get near 3 cycles code has to be arranged so that nothing else is issued to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other apparently equivalent code takes 5. OPTIMIZATIONS Unrolled loops are used to reduce looping overhead. The unrolling is configurable up to 32 limbs/loop for most routines and up to 64 for some. The K7 has 64k L1 code cache so quite big unrolling is allowable. Computed jumps into the unrolling are used to handle sizes not a multiple of the unrolling. An attractive feature of this is that times increase smoothly with operand size, but it may be that some routines should just have simple loops to finish up, especially when PIC adds between 2 and 16 cycles to get eip. Position independent code is implemented using a call to get eip for the computed jumps and a ret is always done, rather than an addl $4,%esp or a popl, so the CPU return address branch prediction stack stays synchronised with the actual stack in memory. Branch prediction, in absence of any history, will guess forward jumps are not taken and backward jumps are taken. Where possible it's arranged that the less likely or less important case is under a taken forward jump. CODING Instructions in general code have been shown grouped if they can execute together, which means up to three direct-path instructions which have no successive dependencies. K7 always decodes three and has out-of-order execution, but the groupings show what slots are available and what dependency chains exist. When there's vector-path instructions an effort is made to get triplets of direct-path instructions in between them, even if there's dependencies, since this maximizes decoding throughput and might save a cycle or two if decoding is the limiting factor. INSTRUCTIONS adcl direct divl 39 cycles back-to-back lodsl,etc vector loop 1 cycle vector (decl/jnz opens up one decode slot) movd reg vector movd mem direct mull 3 cycles when back-to-back with no dependency on a previous mul popl vector (use movl for more than one pop) pushl direct, will pair with a load shrdl %cl vector, 3 cycles, seems to be 3 execute too REFERENCES "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number 22007, revision E, November 1999. Available on-line, http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf "3DNow Technology Manual", AMD publication number 21928F/0-August 1999. This describes the femms and prefetch instructions. Available on-line, http://www.amd.com/K6/k6docs/pdf/21928.pdf "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD publication number 22466, revision B, August 1999. This describes instructions added in the Athalon processor, such as pswapd and the extra prefetch forms. Available on-line, http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf ---------------- Local variables: mode: text fill-column: 76 End: