`array::zip` in combination with `array::map` optimises very poorly #103555

AuroransSolis · 2022-10-26T00:39:03Z

Today for no reason in particular I was curious about the behaviour of arrays when zipped together and then mapped into a single array, roughly:

fn zip_with<T, U, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

I wrote some tests for this and disassembled them in Godbolt to see what was going on behind the scenes and wow was a whole lot going on that seemed like it didn't need to be. So I wrote something kinda like what's in the stdlib for mapping and zipping arrays, just without the iterators, to see if I could do any better. Turns out I absolutely can:

struct ZipWithGuard<T, U, const N: usize> {
    a: *mut [T; N],
    b: *mut [U; N],
    moved: usize,
}

impl<T, U, const N: usize> Drop for ZipWithGuard<T, U, N> {
    fn drop(&mut self) {
        for i in self.moved + 1..N {
            unsafe {
                drop(self.a.cast::<T>().add(i).read());
                drop(self.b.cast::<U>().add(i).read());
            }
        }
    }
}

pub fn zip_with_guarded<T, U, V, F: FnMut(T, U) -> V, const N: usize>(
    a: [T; N],
    b: [U; N],
    mut f: F,
) -> [V; N] {
    let aref = &a as *const _ as *mut [ManuallyDrop<T>; N];
    forget(a);
    let bref = &b as *const _ as *mut [ManuallyDrop<U>; N];
    forget(b);
    let mut guard = ZipWithGuard {
        a: aref,
        b: bref,
        moved: 0,
    };
    let mut out: MaybeUninit<[V; N]> = unsafe { MaybeUninit::uninit().assume_init() };
    let out_ptr = out.as_mut_ptr().cast::<MaybeUninit<V>>();
    while guard.moved < N {
        let i = guard.moved;
        unsafe {
            *out_ptr.add(i) = MaybeUninit::new(f(
                ptr::read(guard.a.cast::<T>().add(i)),
                ptr::read(guard.b.cast::<U>().add(i)),
            ));
        }
        guard.moved += 1;
    }
    unsafe { out.assume_init() }
}

Maybe the design isn't as optimal (or as safe) as it could be, but it performs perfectly well in benchmarks (read: as fast as an unguarded version and anywhere between 3-25* times faster than the zip_with definition shown at the top of this issue. Here's a rough table of execution times averaged over the primitive ops (+, -, *, &, |, ^):

Not really liking how things are looking for .zip().map(). So, question then is: is there something in Rust that needs to change to let these optimisations happen, or do I need to look at putting this in a library somewhere (or maybe stdlib wink wink nudge nudge?)?

*The 25x comes from the [u8; 8] benchmarks. zip_with benches at 11.230ns, and zip_with_guarded benches at 452.593ps, which seems suspicious but I can't seem to find anything wrong with the result.

The text was updated successfully, but these errors were encountered:

the8472 · 2022-10-26T18:50:48Z

I suspect the issue here is that array::zip has the wrong shape. It returns [(T, U); N] but vector instructions generally want ([T; N], [U; N]).

a.zip(b).map(f) isn't an iterator type that describes the desired iteration and then performs it once, rather it's one loop that creates an intermediate result and then a second loop that produces the final output.

Perhaps we could massage the code into a shape that the optimizer can disentangle, but I think it would be better to offer something like your zip_with_guarded, albeit with a nicer name.

nikic · 2022-12-21T13:50:22Z

This is missing an actual reproducer...

Tried a few sizes here: https://rust.godbolt.org/z/Kv7sTcTec I assume the report is about the case where this doesn't get unrolled, like with N=128?

AuroransSolis · 2022-12-22T03:40:52Z

I'm not entirely sure what you're looking for in a reproducer, but I have a benchmark program set up using criterion to check out the performance of the various functions (see the "Benchmark" section at the bottom of this post). More or less I have zip_with_guarded as shown above (renamed to zip_with), and both of these:

fn zip_map_fl<T: Copy, U: Copy, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    let mut out: [MaybeUninit<V>; N] = MaybeUninit::uninit_array();
    for i in 0..N {
        out[i] = MaybeUninit::new(f(a[i], b[i]));
    }
    unsafe {
        MaybeUninit::array_assume_init(out)
    }
}

fn zip_map_std<T, U, V, F: Fn(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

Then I set up a lot of benchmarks. These test the speed of some primitive operations (+, -, *, &, |, ^) on arrays of lengths 8, 16, 32, 64, 128, 256, and 512 for types u8, u16, u32, and u64. This gives a total of 168 benchmarks per function, which reduces down to 28 per test after the primitive tests are averaged for each function, length, and type. The results of this are:

This was run on the latest nightly as of 2022-12-21. I think this is pretty hard evidence that performance using .zip(...).map(...) is not where it should be. zip_with(...) and zip_map_fl(...) are grouped so close together in the bottom section of the graph you can't visually see the difference. But for zip_map_std(...)? It's very obviously slower than the other two. Notice how it takes more time to map together two [u8; 128] than it takes the other methods to map together two [u64; 256]. Even if we just look at shorter inputs like was suggested, it doesn't look good for .zip(...).map(...):

These are just the results for lengths up through 64, and we can pretty clearly see that .zip(...).map(...) is just always slower.

Benchmark

bench.rs:

#![allow(dead_code)]
#![feature(array_zip, maybe_uninit_array_assume_init, maybe_uninit_uninit_array)]

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{rngs::OsRng, Rng};
use std::mem::MaybeUninit;

macro_rules! defbenches {
    ($c:ident, $fn:path, $intype:ty, [$($len:literal),+$(,)?]) => {{
        $(defbenches!(@ops $c, $fn, $intype, $len);)+
    }};
    (@ops $c:ident, $fn:path, $intype:ty, $len:literal) => {{
        let a: [$intype; $len] = OsRng.gen();
        let b: [$intype; $len] = OsRng.gen();
        defbenches!(@final $c, $fn, $intype, $len, +, a, b);
        defbenches!(@final $c, $fn, $intype, $len, -, a, b);
        defbenches!(@final $c, $fn, $intype, $len, *, a, b);
        defbenches!(@final $c, $fn, $intype, $len, &, a, b);
        defbenches!(@final $c, $fn, $intype, $len, |, a, b);
        defbenches!(@final $c, $fn, $intype, $len, ^, a, b);
    }};
    (@final $c:ident, $fn:path, $intype:ty, $len:literal, $op:tt, $a:ident, $b:ident) => {{
        $c.bench_function(
            concat!(stringify!($fn), " | [", stringify!($intype), "; ", stringify!($len), "] | ", stringify!($op)),
            |b| {
                b.iter(|| black_box($fn($a, $b, |a, b| a $op b)));
            },
        );
    }};
}

fn benchmark_primitives(c: &mut Criterion) {
    defbenches!(c, zip_with, u8, [8, 16, 32, 64, 128]);
    defbenches!(c, zip_map_std, u8, [8, 16, 32, 64, 128]);
    defbenches!(c, zip_map_fl, u8, [8, 16, 32, 64, 128]);

    defbenches!(c, zip_with, u16, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u16, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u16, [8, 16, 32, 64, 128, 256, 512]);

    defbenches!(c, zip_with, u32, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u32, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u32, [8, 16, 32, 64, 128, 256, 512]);

    defbenches!(c, zip_with, u64, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_std, u64, [8, 16, 32, 64, 128, 256, 512]);
    defbenches!(c, zip_map_fl, u64, [8, 16, 32, 64, 128, 256, 512]);
}

struct ZipWithGuard<T, U, const N: usize> {
    a: *mut [T; N],
    b: *mut [U; N],
    moved: usize,
}

impl<T, U, const N: usize> Drop for ZipWithGuard<T, U, N> {
    fn drop(&mut self) {
        if needs_drop::<T>() || needs_drop::<U>() {
            for i in self.moved + 1..N {
                unsafe {
                    drop(self.a.cast::<T>().add(i).read());
                    drop(self.b.cast::<U>().add(i).read());
                }
            }
        }
    }
}

pub fn zip_with<T, U, V, F: FnMut(T, U) -> V, const N: usize>(
    a: [T; N],
    b: [U; N],
    mut f: F,
) -> [V; N] {
    let aref = &a as *const _ as *mut [ManuallyDrop<T>; N];
    forget(a);
    let bref = &b as *const _ as *mut [ManuallyDrop<U>; N];
    forget(b);
    let mut guard = ZipWithGuard {
        a: aref,
        b: bref,
        moved: 0,
    };
    let mut out: MaybeUninit<[V; N]> = unsafe { MaybeUninit::uninit().assume_init() };
    let out_ptr = out.as_mut_ptr().cast::<MaybeUninit<V>>();
    while guard.moved < N {
        let i = guard.moved;
        unsafe {
            *out_ptr.add(i) = MaybeUninit::new(f(
                guard.a.cast::<T>().add(i).read(),
                guard.b.cast::<U>().add(i).read(),
            ));
        }
        guard.moved += 1;
    }
    unsafe { out.assume_init() }
}

fn zip_map_fl<T: Copy, U: Copy, V, F: FnMut(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], mut f: F) -> [V; N] {
    let mut out: [MaybeUninit<V>; N] = MaybeUninit::uninit_array();
    for i in 0..N {
        out[i] = MaybeUninit::new(f(a[i], b[i]));
    }
    unsafe {
        MaybeUninit::array_assume_init(out)
    }
}

fn zip_map_std<T, U, V, F: Fn(T, U) -> V, const N: usize>(a: [T; N], b: [U; N], f: F) -> [V; N] {
    a.zip(b).map(|(a, b)| f(a, b))
}

criterion_group! {
    name = benches;
    config = Criterion::default();
    targets = benchmark_primitives
}

criterion_main! {
    benches
}

Cargo.toml

[dev-dependencies]
criterion = { version = "0.4.0", features = ["html_reports"] }
rand = { version = "0.8.5", features = ["min_const_gen"] }

[[bench]]
name = "bench"
path = "src/bench.rs"
harness = false

scottmcm · 2023-02-07T03:21:42Z

Part of this is the classic "too many copies" trio

But some of it is probably just that, being eager, it won't optimize unless it fully unrolls.

Remove array_zip `[T; N]::zip` is "eager" but most zips are mapped. This causes poor optimization in generated code. This is a fundamental design issue and "zip" is "prime real estate" in terms of function names, so let's free it up again. - FCP concluded in rust-lang/rust#80094 (comment) - Closes rust-lang/rust#80094 - Closes rust-lang/rust#103555 Could use review to make sure we aren't losing any essential codegen tests. r? `@scottmcm`

inquisitivecrystal added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Oct 26, 2022

the8472 mentioned this issue Oct 26, 2022

Tracking Issue for array_zip #80094

Closed

2 tasks

workingjubilee added the A-codegen Area: Code generation label Mar 6, 2023

workingjubilee mentioned this issue May 30, 2023

Remove array_zip #112096

Merged

bors closed this as completed in 88160ab May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`array::zip` in combination with `array::map` optimises very poorly #103555

`array::zip` in combination with `array::map` optimises very poorly #103555

AuroransSolis commented Oct 26, 2022

the8472 commented Oct 26, 2022

nikic commented Dec 21, 2022

AuroransSolis commented Dec 22, 2022 •

edited

Loading

scottmcm commented Feb 7, 2023

array::zip in combination with array::map optimises very poorly #103555

array::zip in combination with array::map optimises very poorly #103555

Comments

AuroransSolis commented Oct 26, 2022

the8472 commented Oct 26, 2022

nikic commented Dec 21, 2022

AuroransSolis commented Dec 22, 2022 • edited Loading

scottmcm commented Feb 7, 2023

`array::zip` in combination with `array::map` optimises very poorly #103555

`array::zip` in combination with `array::map` optimises very poorly #103555

AuroransSolis commented Dec 22, 2022 •

edited

Loading