This directory contains mpn functions optimized for DEC Alpha processors. ALPHA ASSEMBLY RULES AND REGULATIONS The `.prologue N' pseudo op marks the end of instruction that needs special handling by unwinding. It also says whether $27 is really needed for computing the gp. The `.mask M' pseudo op says which registers are saved on the stack, and at what offset in the frame. Cray code is very very different... RELEVANT OPTIMIZATION ISSUES EV4 1. This chip has very limited store bandwidth. The on-chip L1 cache is write-through, and a cache line is transfered from the store buffer to the off-chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. 2. Pairing is possible between memory instructions and integer arithmetic instructions. 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these cycles are pipelined. Thus, multiply instructions can be issued at a rate of one each 21st cycle. EV5 1. The memory bandwidth of this chip seems excellent, both for loads and stores. Even when the working set is larger than the on-chip L1 and L2 caches, the performance remain almost unaffected. 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle. umulh has a measured latency of 14 cycles and an issue rate of 1 each 10th cycle. But the exact timing is somewhat confusing. 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 are memory operations. This will take at least ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data cache cycles, which should be completely hidden in the 19 issue cycles. The computation is inherently serial, with these dependencies: ldq ldq \ /\ (or) addq | |\ / \ | | addq cmpult \ | | cmpult | \ / or I.e., 3 operations are needed between carry-in and carry-out, making 12 cycles the absolute minimum for the 4 limbs. We could replace the `or' with a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that might waste a cycle on EV4. The total depth remain unaffected, since cmov has a latency of 2 cycles. addq / \ addq cmpult | \ cmpult -> cmovne Montgomery has a slightly different way of computing carry that requires one less instruction, but has depth 4 (instead of the current 3). Since the code is currently instruction issue bound, Montgomery's idea should save us 1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb. Unfortunately, this method will not be good for the EV6. EV6 Here we have a really parallel pipeline, capable of issuing up to 4 integer instructions per cycle. One integer multiply instruction can issue each cycle. To get optimal speed, we need to pretend we are vectorizing the code, i.e., minimize the iterative dependencies. There are two dependencies to watch out for. 1) Address arithmetic dependencies, and 2) carry propagation dependencies. We can avoid serializing due to address arithmetic by unrolling the loop, so that addresses don't depend heavily on an index variable. Avoiding serializing because of carry propagation is trickier; the ultimate performance of the code will be determined of the number of latency cycles it takes from accepting carry-in to a vector point until we can generate carry-out. addmul_1 dependency diagram (needs 175 columns wide screen): i s s i u n e s d t r i u l n c i s t v t i e r o u n v c a t t l i y u o p e n e s s s issue in cycle -1 ldq / \ 0 | \ | \ 1 | | | | 2 | | ldq | | / \ 3 | mulq | \ | \ | \ 4 umulh \ | | | | | | 5 | | | | ldq | | | | / \ 4calm 6 | | ldq | mulq | \ | | / | \ | \ 4casm 7 | | / umulh \ | | 6 | || | | | | 3aal 8 | || | | | | ldq 7 | || | | | | / \ 4calm 9 | || | | ldq | mulq | \ 9 | || | | / | \ | \ 4casm 10 | || | | / umulh \ | | 9 | || | || | | | | 3aal 11 | addq | || | | | | ldq 9 | // \ | || | | | | / \ 4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \ 13 \ / // \ | || | | / | \ | \ 4casm 13 addq cmpult stq | || | | / umulh \ | | 11 \ / | || | || | | | | 3aal 14 addq | addq | || | | | | ldq 10 \ | // \ | || | | | | / \ 4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \ 13 \ / // \ | || | | / | \ | \ 4casm 16 addq cmpult stq | || | | / umulh \ | | 11 \ / | || | || | | | | 3aal 17 addq | addq | || | | | | 10 \ | // \ | || | | | | 4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq 13 \ / // \ | || | | / | \ 4casm 19 addq cmpult stq | || | | / umulh \ 11 \ / | || | || | | 3aal 20 addq | addq | || | | 10 \ | // \ | || | | 4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq \ / // \ | || | | / 22 addq cmpult stq | || | | / \ / | || | || 23 addq | addq | || \ | // \ | || 24 cy ----> \ cmpult addq<-cy | || \ / // \ | || 25 addq cmpult stq | || \ / | || 26 addq | addq \ | // \ 27 cy ----> \ cmpult addq<-cy \ / // \ 28 addq cmpult stq \ / As many as 6 consecutive points will be under execution simultaneously, or if we addq schedule loads even further away, maybe 7 or 8. But the number of live quantities \ is reasonable, and can easily be satisfied. cy ---->