From 11da511c784eca003deb90c23570f0873954e0de Mon Sep 17 00:00:00 2001 From: Duncan Wilkie Date: Sat, 18 Nov 2023 06:11:09 -0600 Subject: Initial commit. --- gmp-6.3.0/mpn/ia64/mode1o.asm | 342 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 342 insertions(+) create mode 100644 gmp-6.3.0/mpn/ia64/mode1o.asm (limited to 'gmp-6.3.0/mpn/ia64/mode1o.asm') diff --git a/gmp-6.3.0/mpn/ia64/mode1o.asm b/gmp-6.3.0/mpn/ia64/mode1o.asm new file mode 100644 index 0000000..14d5e81 --- /dev/null +++ b/gmp-6.3.0/mpn/ia64/mode1o.asm @@ -0,0 +1,342 @@ +dnl Itanium-2 mpn_modexact_1c_odd -- mpn by 1 exact remainder. + +dnl Contributed to the GNU project by Kevin Ryde. + +dnl Copyright 2003-2005 Free Software Foundation, Inc. + +dnl This file is part of the GNU MP Library. +dnl +dnl The GNU MP Library is free software; you can redistribute it and/or modify +dnl it under the terms of either: +dnl +dnl * the GNU Lesser General Public License as published by the Free +dnl Software Foundation; either version 3 of the License, or (at your +dnl option) any later version. +dnl +dnl or +dnl +dnl * the GNU General Public License as published by the Free Software +dnl Foundation; either version 2 of the License, or (at your option) any +dnl later version. +dnl +dnl or both in parallel, as here. +dnl +dnl The GNU MP Library is distributed in the hope that it will be useful, but +dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +dnl for more details. +dnl +dnl You should have received copies of the GNU General Public License and the +dnl GNU Lesser General Public License along with the GNU MP Library. If not, +dnl see https://www.gnu.org/licenses/. + +include(`../config.m4') + + +C cycles/limb +C Itanium: 15 +C Itanium 2: 8 + + +dnl Usage: ABI32(`code') +dnl +dnl Emit the given code only under HAVE_ABI_32. +dnl +define(ABI32, +m4_assert_onearg() +`ifdef(`HAVE_ABI_32',`$1')') + + +C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, +C mp_limb_t divisor, mp_limb_t carry); +C +C The modexact algorithm is usually conceived as a dependent chain +C +C l = src[i] - c +C q = low(l * inverse) +C c = high(q*divisor) + (src[i]=2), and the +C calculation of q by the initial different scheme. +C +C +C Entry Sequence: +C +C In the entry sequence, the critical path is the calculation of the +C inverse, so this is begun first and optimized. Apart from that, ar.lc is +C established nice and early so the br.cloop's should predict perfectly. +C And the load for the low limbs src[0] and src[1] can be initiated long +C ahead of where they're needed. +C +C +C Inverse Calculation: +C +C The initial 8-bit inverse is calculated using a table lookup. If it hits +C L1 (which is likely if we're called several times) then it should take a +C total 4 cycles, otherwise hopefully L2 for 9 cycles. This is considered +C the best approach, on balance. It could be done bitwise, but that would +C probably be about 14 cycles (2 per bit beyond the first couple). Or it +C could be taken from 4 bits to 8 with xmpy doubling as used beyond 8 bits, +C but that would be about 11 cycles. +C +C The table is not the same as binvert_limb_table, instead it's 256 bytes, +C designed to be indexed by the low byte of the divisor. The divisor is +C always odd, so the relevant data is every second byte in the table. The +C padding lets us use zxt1 instead of extr.u, the latter would cost an extra +C cycle because it must go down I0, and we're using the first I0 slot to get +C ip. The extra 128 bytes of padding should be insignificant compared to +C typical ia64 code bloat. +C +C Having the table in .text allows us to use IP-relative addressing, +C avoiding a fetch from ltoff. .rodata is apparently not suitable for use +C IP-relative, it gets a linker relocation overflow on GNU/Linux. +C +C +C Load Scheduling: +C +C In the main loop, the data loads are scheduled for an L2 hit, which means +C 6 cycles for the data ready to use. In fact we end up 7 cycles ahead. In +C any case that scheduling is achieved simply by doing the load (and xmpy.l +C for "si") in the immediately preceding iteration. +C +C The main loop requires size >= 2, and we handle size==1 by an initial +C br.cloop to enter the loop only if size>1. Since ar.lc is established +C early, this should predict perfectly. +C +C +C Not done: +C +C Consideration was given to using a plain "(src[0]-c) % divisor" for +C size==1, but cycle counting suggests about 50 for the sort of approach +C taken by gcc __umodsi3, versus about 47 for the modexact. (Both assuming +C L1 hits for their respective fetching.) +C +C Consideration was given to a test for high 1 + ;; + + C size==1, finish up now + xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) + mov ar.lc = r2 C I0 + ;; + getf.sig r8 = f9 C M2 return c + br.ret.sptk.many b0 + + + +.Ltop: + C r2 saved ar.lc + C f6 divisor + C f7 inverse + C f8 -inverse + C f9 carry + C f10 src[i] * inverse + C f11 scratch src[i+1] + + add r16 = 160, r32 + ldf8 f11 = [r32], 8 C src[i+1] + ;; + C 2 cycles + + lfetch [r16] + xma.l f10 = f9, f8, f10 C q = c * -inverse + si + ;; + C 3 cycles + +.Lentry: + xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) + xmpy.l f10 = f11, f7 C si = src[i] * inverse + br.cloop.sptk.few.clr .Ltop + ;; + + + + xma.l f10 = f9, f8, f10 C q = c * -inverse + si + mov ar.lc = r2 C I0 + ;; + xma.hu f9 = f10, f6, f9 C c = high(q * divisor + c) + ;; + getf.sig r8 = f9 C M2 return c + br.ret.sptk.many b0 + +EPILOGUE() -- cgit v1.2.3