Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve udivmod Performance #16

Open
nlordell opened this issue May 7, 2022 · 5 comments · Fixed by #18
Open

Improve udivmod Performance #16

nlordell opened this issue May 7, 2022 · 5 comments · Fixed by #18

Comments

@nlordell
Copy link
Owner

nlordell commented May 7, 2022

It looks like udivmod performance can further be increased when dividing two "large" (i.e. greater than u128::MAX).

Specifically, the algorithm in primitive-types performs significantly better for these larger numbers.

See #15 (comment) and #15 (comment) for more details and more resources on fast integer division algorithms (including libgmp paper)

@nlordell
Copy link
Owner Author

nlordell commented May 8, 2022

Currently, I believe the best way forward is to port the code from intx.

@nlordell
Copy link
Owner Author

nlordell commented May 18, 2022

Ported 128-bit integer division from intx to compare it with Rust's u128 type:

lib.rs
const fn reciprocal_table_item(d9: u8) -> u16 {
    (0x7fd00 / (0x100 | (d9 as u32))) as _
}

const fn reciprocal_table_items() -> [u16; 256] {
    let mut table = [0; 256];

    macro_rules! repeat {
        (4; $x:expr) => {
            table[($x) + 0] = reciprocal_table_item(($x) + 0);
            table[($x) + 1] = reciprocal_table_item(($x) + 1);
            table[($x) + 2] = reciprocal_table_item(($x) + 2);
            table[($x) + 3] = reciprocal_table_item(($x) + 3);
        };
        (32; $x:expr) => {
            repeat!(4; ($x) + (4 * 0));
            repeat!(4; ($x) + (4 * 1));
            repeat!(4; ($x) + (4 * 2));
            repeat!(4; ($x) + (4 * 3));
            repeat!(4; ($x) + (4 * 4));
            repeat!(4; ($x) + (4 * 5));
            repeat!(4; ($x) + (4 * 6));
            repeat!(4; ($x) + (4 * 7));
        };
        (256) => {
            repeat!(32; 32 * 0);
            repeat!(32; 32 * 1);
            repeat!(32; 32 * 2);
            repeat!(32; 32 * 3);
            repeat!(32; 32 * 4);
            repeat!(32; 32 * 5);
            repeat!(32; 32 * 6);
            repeat!(32; 32 * 7);
        };
    }
    repeat!(256);

    table
}

/// Reciprocal lookup table.
const RECIPROCAL_TABLE: [u16; 256] = reciprocal_table_items();

/// Full unsigned multiplication 64 x 64 -> 128.
#[inline(always)]
const fn umul(x: u64, y: u64) -> u128 {
    (x as u128) * (y as u128)
}

#[inline(always)]
const fn u128w(x: u128) -> [u64; 2] {
    //mem::transmute(x)
    [x as _, (x >> 64) as _]
}

#[inline(always)]
const fn u128v(lo: u64, hi: u64) -> u128 {
    (lo as u128) + ((hi as u128) << 64)
}

/// Computes the reciprocal (2^128 - 1) / d - 2^64 for normalized d.
///
/// Based on Algorithm 2 from "Improved division by invariant integers".
#[inline]
fn reciprocal_2by1(d: u64) -> u64 {
    debug_assert!(d & 0x8000000000000000 != 0, "must be normalized");

    let d9 = d >> 55;
    let v0 = RECIPROCAL_TABLE[(d9 - 256) as usize] as u64;
    //let v0 = unsafe { RECIPROCAL_TABLE.get_unchecked((d9 - 256) as usize) } as u64;

    let d40 = (d >> 24) + 1;
    let v1: u64 = ((v0 as u64) << 11)
        .wrapping_sub((v0 * v0).wrapping_mul(d40) >> 40)
        .wrapping_sub(1);

    let v2 = (v1 << 13).wrapping_add(
        v1.wrapping_mul(0x1000000000000000_u64.wrapping_sub(v1.wrapping_mul(d40))) >> 47,
    );

    let d0 = d & 1;
    let d63 = (d >> 1) + d0; // ceil(d/2)
    let e = ((v2 >> 1) & 0_u64.wrapping_sub(d0)).wrapping_sub(v2.wrapping_mul(d63));
    let v3 = (u128w(umul(v2, e))[1] >> 1).wrapping_add(v2 << 31);

    v3.wrapping_sub(u128w(umul(v3, d).wrapping_add(d as _))[1])
        .wrapping_sub(d)
}

#[inline]
fn reciprocal_3by2(d: u128) -> u64 {
    let mut v = reciprocal_2by1(u128w(d)[1]);
    let mut p = u128w(d)[1].wrapping_mul(v);
    p = p.wrapping_add(u128w(d)[0]);
    if p < u128w(d)[0] {
        v = v.wrapping_sub(1);
        if p >= u128w(d)[1] {
            v = v.wrapping_sub(1);
            p = p.wrapping_sub(u128w(d)[1]);
        }
        p = p.wrapping_sub(u128w(d)[1]);
    }

    let t = umul(v, u128w(d)[0]);

    p = p.wrapping_add(u128w(t)[1]);
    if p < u128w(t)[1] {
        v = v.wrapping_sub(1);
        if p >= u128w(d)[1] && (p > u128w(d)[1] || u128w(t)[0] >= u128w(d)[0]) {
            v = v.wrapping_sub(1);
        }
    }
    v
}

struct DivResult<Q, R = Q> {
    quot: Q,
    rem: R,
}

#[inline]
fn udivrem_2by1(u: u128, d: u64, v: u64) -> DivResult<u64> {
    let mut q = umul(v, u128w(u)[1]);
    q = q.wrapping_add(u);

    q = q.wrapping_add(1 << 64);

    let mut r = u128w(u)[0].wrapping_sub(u128w(q)[1].wrapping_mul(d));

    if r > u128w(q)[0] {
        q = q.wrapping_sub(1 << 64);
        r = r.wrapping_add(d);
    }

    if r >= d {
        q = q.wrapping_add(1 << 64);
        r = r.wrapping_sub(d);
    }

    DivResult {
        quot: u128w(q)[1],
        rem: r,
    }
}

#[inline]
fn udivrem_3by2(u2: u64, u1: u64, u0: u64, d: u128, v: u64) -> DivResult<u64, u128> {
    let mut q = umul(v, u2);
    q = q.wrapping_add(u128v(u1, u2));

    let mut r1 = u1.wrapping_sub(u128w(q)[1].wrapping_mul(u128w(d)[1]));

    let t = umul(u128w(d)[0], u128w(q)[1]);

    let mut r = u128v(u0, r1).wrapping_sub(t).wrapping_sub(d);
    r1 = u128w(r)[1];

    q = q.wrapping_add(1 << 64);

    if r1 >= u128w(q)[0] {
        q = q.wrapping_sub(1 << 64);
        r = r.wrapping_add(d);
    }

    if r >= d {
        q = q.wrapping_add(1 << 64);
        r = r.wrapping_sub(d);
    }

    DivResult {
        quot: u128w(q)[1],
        rem: r,
    }
}

#[inline]
fn udivrem(x: u128, y: u128) -> DivResult<u128> {
    if u128w(y)[1] == 0 {
        assert!(u128w(y)[0] != 0, "division by 0");

        let lsh = u128w(y)[0].leading_zeros();
        let rsh = (64 - lsh) % 64;
        let rsh_mask = ((lsh == 0) as u64).wrapping_sub(1);

        let yn = u128w(y)[0] << lsh;
        let xn_lo = u128w(x)[0] << lsh;
        let xn_hi = (u128w(x)[1] << lsh) | ((u128w(x)[0] >> rsh) & rsh_mask);
        let xn_ex = (u128w(x)[1] >> rsh) & rsh_mask;

        let v = reciprocal_2by1(yn);
        let res1 = udivrem_2by1(u128v(xn_hi, xn_ex), yn, v);
        let res2 = udivrem_2by1(u128v(xn_lo, res1.rem), yn, v);
        return DivResult {
            quot: u128v(res2.quot, res1.quot),
            rem: (res2.rem >> lsh) as u128,
        };
    }

    if u128w(y)[1] > u128w(x)[1] {
        return DivResult { quot: 0, rem: x };
    }

    let lsh = u128w(y)[1].leading_zeros();
    if lsh == 0 {
        let q = ((u128w(y)[1] < u128w(x)[1]) as u128) | ((u128w(y)[0] <= u128w(x)[0]) as u128);
        return DivResult {
            quot: q,
            rem: x.wrapping_sub(if q != 0 { y } else { 0 }),
        };
    }

    let rsh = 64 - lsh;

    let yn_lo = u128w(y)[0] << lsh;
    let yn_hi = (u128w(y)[1] << lsh) | (u128w(y)[0] >> rsh);
    let xn_lo = u128w(x)[0] << lsh;
    let xn_hi = (u128w(x)[1] << lsh) | (u128w(x)[0] >> rsh);
    let xn_ex = u128w(x)[1] >> rsh;

    let v = reciprocal_3by2(u128v(yn_lo, yn_hi));
    let res = udivrem_3by2(xn_ex, xn_hi, xn_lo, u128v(yn_lo, yn_hi), v);

    DivResult {
        quot: res.quot as _,
        rem: res.rem >> lsh,
    }
}

#[inline]
pub fn div(x: u128, y: u128) -> u128 {
    udivrem(x, y).quot
}

And benchmarked it:

udiv128/intx/1          time:   [655.94 ps 661.79 ps 668.15 ps]                                                                             
udiv128/llvm/1          time:   [656.48 ps 663.42 ps 671.02 ps]

So they have pretty much equivalent performance. I.e., there is no need to go replacing 128-bit division anywhere - phew.

Will continue porting 256-bit integer division.

@nlordell
Copy link
Owner Author

nlordell commented Aug 3, 2022

Re-opening this issue. Integer division performance was greatly improved with #18, but there are still some cases that can be improved. Specifically, intx provides faster u256 division implementation that would improve the speed in the following cases:

  • #### / ...# faster
  • .### / ..## faster

(Note: In this notation, . represents a 64-bit half-word with a 0 value and # with a non-0 value).

@nlordell nlordell reopened this Aug 3, 2022
@nlordell
Copy link
Owner Author

We should port the ruint division implementation, as it performs much better in the following cases:

U256::div/####/####
U256::div/####/.###
U256::div/####/..##
U256::div/####/...#
U256::div/.###/.###
U256::div/.###/..##
U256::div/.###/...#

Note that in general, ruint performance scored worse on Apple Silicon, including the following division cases:

U256::div/..##/..##
U256::div/..##/...#
U256::div/...#/...#

@nlordell
Copy link
Owner Author

Reference to the ruint division code: https://github.com/recmo/uint/blob/main/src/algorithms/div/mod.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant