Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array::zip in combination with array::map optimises very poorly #103555

Closed
AuroransSolis opened this issue Oct 26, 2022 · 4 comments · Fixed by #112096
Closed

array::zip in combination with array::map optimises very poorly #103555

AuroransSolis opened this issue Oct 26, 2022 · 4 comments · Fixed by #112096
Labels
A-array Area: `[T; N]` A-codegen Area: Code generation A-const-generics Area: const generics (parameters and arguments) C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@AuroransSolis
Copy link

Today for no reason in particular I was curious about the behaviour of arrays when zipped together and then mapped into a single array, roughly:

fn zip_with<T, U, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

I wrote some tests for this and disassembled them in Godbolt to see what was going on behind the scenes and wow was a whole lot going on that seemed like it didn't need to be. So I wrote something kinda like what's in the stdlib for mapping and zipping arrays, just without the iterators, to see if I could do any better. Turns out I absolutely can:

struct ZipWithGuard<T, U, const N: usize> {
    a: *mut [T; N],
    b: *mut [U; N],
    moved: usize,
}

impl<T, U, const N: usize> Drop for ZipWithGuard<T, U, N> {
    fn drop(&mut self) {
        for i in self.moved + 1..N {
            unsafe {
                drop(self.a.cast::<T>().add(i).read());
                drop(self.b.cast::<U>().add(i).read());
            }
        }
    }
}

pub fn zip_with_guarded<T, U, V, F: FnMut(T, U) -> V, const N: usize>(
    a: [T; N],
    b: [U; N],
    mut f: F,
) -> [V; N] {
    let aref = &a as *const _ as *mut [ManuallyDrop<T>; N];
    forget(a);
    let bref = &b as *const _ as *mut [ManuallyDrop<U>; N];
    forget(b);
    let mut guard = ZipWithGuard {
        a: aref,
        b: bref,
        moved: 0,
    };
    let mut out: MaybeUninit<[V; N]> = unsafe { MaybeUninit::uninit().assume_init() };
    let out_ptr = out.as_mut_ptr().cast::<MaybeUninit<V>>();
    while guard.moved < N {
        let i = guard.moved;
        unsafe {
            *out_ptr.add(i) = MaybeUninit::new(f(
                ptr::read(guard.a.cast::<T>().add(i)),
                ptr::read(guard.b.cast::<U>().add(i)),
            ));
        }
        guard.moved += 1;
    }
    unsafe { out.assume_init() }
}

Maybe the design isn't as optimal (or as safe) as it could be, but it performs perfectly well in benchmarks (read: as fast as an unguarded version and anywhere between 3-25* times faster than the zip_with definition shown at the top of this issue. Here's a rough table of execution times averaged over the primitive ops (+, -, *, &, |, ^):
image

Not really liking how things are looking for .zip().map(). So, question then is: is there something in Rust that needs to change to let these optimisations happen, or do I need to look at putting this in a library somewhere (or maybe stdlib wink wink nudge nudge?)?

*The 25x comes from the [u8; 8] benchmarks. zip_with benches at 11.230ns, and zip_with_guarded benches at 452.593ps, which seems suspicious but I can't seem to find anything wrong with the result.

@rustbot rustbot added A-array Area: `[T; N]` A-const-generics Area: const generics (parameters and arguments) C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 26, 2022
@inquisitivecrystal inquisitivecrystal added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Oct 26, 2022
@the8472
Copy link
Member

the8472 commented Oct 26, 2022

I suspect the issue here is that array::zip has the wrong shape. It returns [(T, U); N] but vector instructions generally want ([T; N], [U; N]).

a.zip(b).map(f) isn't an iterator type that describes the desired iteration and then performs it once, rather it's one loop that creates an intermediate result and then a second loop that produces the final output.

Perhaps we could massage the code into a shape that the optimizer can disentangle, but I think it would be better to offer something like your zip_with_guarded, albeit with a nicer name.

@nikic
Copy link
Contributor

nikic commented Dec 21, 2022

This is missing an actual reproducer...

Tried a few sizes here: https://rust.godbolt.org/z/Kv7sTcTec I assume the report is about the case where this doesn't get unrolled, like with N=128?

@AuroransSolis
Copy link
Author

AuroransSolis commented Dec 22, 2022

I'm not entirely sure what you're looking for in a reproducer, but I have a benchmark program set up using criterion to check out the performance of the various functions (see the "Benchmark" section at the bottom of this post). More or less I have zip_with_guarded as shown above (renamed to zip_with), and both of these:

fn zip_map_fl<T: Copy, U: Copy, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    let mut out: [MaybeUninit<V>; N] = MaybeUninit::uninit_array();
    for i in 0..N {
        out[i] = MaybeUninit::new(f(a[i], b[i]));
    }
    unsafe {
        MaybeUninit::array_assume_init(out)
    }
}

fn zip_map_std<T, U, V, F: Fn(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

Then I set up a lot of benchmarks. These test the speed of some primitive operations (+, -, *, &, |, ^) on arrays of lengths 8, 16, 32, 64, 128, 256, and 512 for types u8, u16, u32, and u64. This gives a total of 168 benchmarks per function, which reduces down to 28 per test after the primitive tests are averaged for each function, length, and type. The results of this are:
image
This was run on the latest nightly as of 2022-12-21. I think this is pretty hard evidence that performance using .zip(...).map(...) is not where it should be. zip_with(...) and zip_map_fl(...) are grouped so close together in the bottom section of the graph you can't visually see the difference. But for zip_map_std(...)? It's very obviously slower than the other two. Notice how it takes more time to map together two [u8; 128] than it takes the other methods to map together two [u64; 256]. Even if we just look at shorter inputs like was suggested, it doesn't look good for .zip(...).map(...):
image
These are just the results for lengths up through 64, and we can pretty clearly see that .zip(...).map(...) is just always slower.

Benchmark

bench.rs:

#![allow(dead_code)]
#![feature(array_zip, maybe_uninit_array_assume_init, maybe_uninit_uninit_array)]

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{rngs::OsRng, Rng};
use std::mem::MaybeUninit;

macro_rules! defbenches {
    ($c:ident, $fn:path, $intype:ty, [$($len:literal),+$(,)?]) => {{
        $(defbenches!(@ops $c, $fn, $intype, $len);)+
    }};
    (@ops $c:ident, $fn:path, $intype:ty, $len:literal) => {{
        let a: [$intype; $len] = OsRng.gen();
        let b: [$intype; $len] = OsRng.gen();
        defbenches!(@final $c, $fn, $intype, $len, +, a, b);
        defbenches!(@final $c, $fn, $intype, $len, -, a, b);
        defbenches!(@final $c, $fn, $intype, $len, *, a, b);
        defbenches!(@final $c, $fn, $intype, $len, &, a, b);
        defbenches!(@final $c, $fn, $intype, $len, |, a, b);
        defbenches!(@final $c, $fn, $intype, $len, ^, a, b);
    }};
    (@final $c:ident, $fn:path, $intype:ty, $len:literal, $op:tt, $a:ident, $b:ident) => {{
        $c.bench_function(
            concat!(stringify!($fn), " | [", stringify!($intype), "; ", stringify!($len), "] | ", stringify!($op)),
            |b| {
                b.iter(|| black_box($fn($a, $b, |a, b| a $op b)));
            },
        );
    }};
}

fn benchmark_primitives(c: &mut Criterion) {
    defbenches!(c, zip_with, u8, [8, 16, 32, 64, 128]);
    defbenches!(c, zip_map_std, u8, [8, 16, 32, 64, 128]);
    defbenches!(c, zip_map_fl, u8, [8, 16, 32, 64, 128]);

    defbenches!(c, zip_with, u16, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u16, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u16, [8, 16, 32, 64, 128, 256, 512]);

    defbenches!(c, zip_with, u32, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u32, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u32, [8, 16, 32, 64, 128, 256, 512]);

    defbenches!(c, zip_with, u64, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u64, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u64, [8, 16, 32, 64, 128, 256, 512]);
}

struct ZipWithGuard<T, U, const N: usize> {
    a: *mut [T; N],
    b: *mut [U; N],
    moved: usize,
}

impl<T, U, const N: usize> Drop for ZipWithGuard<T, U, N> {
    fn drop(&mut self) {
        if needs_drop::<T>() || needs_drop::<U>() {
            for i in self.moved + 1..N {
                unsafe {
                    drop(self.a.cast::<T>().add(i).read());
                    drop(self.b.cast::<U>().add(i).read());
                }
            }
        }
    }
}

pub fn zip_with<T, U, V, F: FnMut(T, U) -> V, const N: usize>(
    a: [T; N],
    b: [U; N],
    mut f: F,
) -> [V; N] {
    let aref = &a as *const _ as *mut [ManuallyDrop<T>; N];
    forget(a);
    let bref = &b as *const _ as *mut [ManuallyDrop<U>; N];
    forget(b);
    let mut guard = ZipWithGuard {
        a: aref,
        b: bref,
        moved: 0,
    };
    let mut out: MaybeUninit<[V; N]> = unsafe { MaybeUninit::uninit().assume_init() };
    let out_ptr = out.as_mut_ptr().cast::<MaybeUninit<V>>();
    while guard.moved < N {
        let i = guard.moved;
        unsafe {
            *out_ptr.add(i) = MaybeUninit::new(f(
                guard.a.cast::<T>().add(i).read(),
                guard.b.cast::<U>().add(i).read(),
            ));
        }
        guard.moved += 1;
    }
    unsafe { out.assume_init() }
}

fn zip_map_fl<T: Copy, U: Copy, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    let mut out: [MaybeUninit<V>; N] = MaybeUninit::uninit_array();
    for i in 0..N {
        out[i] = MaybeUninit::new(f(a[i], b[i]));
    }
    unsafe {
        MaybeUninit::array_assume_init(out)
    }
}

fn zip_map_std<T, U, V, F: Fn(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

criterion_group! {
    name = benches;
    config = Criterion::default();
    targets = benchmark_primitives
}

criterion_main! {
    benches
}

Cargo.toml

[dev-dependencies]
criterion = { version = "0.4.0", features = ["html_reports"] }
rand = { version = "0.8.5", features = ["min_const_gen"] }

[[bench]]
name = "bench"
path = "src/bench.rs"
harness = false

@scottmcm
Copy link
Member

scottmcm commented Feb 7, 2023

Part of this is the classic "too many copies" trio

But some of it is probably just that, being eager, it won't optimize unless it fully unrolls.

@workingjubilee workingjubilee added the A-codegen Area: Code generation label Mar 6, 2023
@bors bors closed this as completed in 88160ab May 31, 2023
thomcc pushed a commit to tcdi/postgrestd that referenced this issue Aug 24, 2023
Remove array_zip

`[T; N]::zip` is "eager" but most zips are mapped. This causes poor optimization in generated code. This is a fundamental design issue and "zip" is "prime real estate" in terms of function names, so let's free it up again.

- FCP concluded in rust-lang/rust#80094 (comment)
- Closes rust-lang/rust#80094
- Closes rust-lang/rust#103555

Could use review to make sure we aren't losing any essential codegen tests.
r? `@scottmcm`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-array Area: `[T; N]` A-codegen Area: Code generation A-const-generics Area: const generics (parameters and arguments) C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants