Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array::map stack usage #86912

Open
programmerjake opened this issue Jul 6, 2021 · 12 comments
Open

array::map stack usage #86912

programmerjake opened this issue Jul 6, 2021 · 12 comments
Labels
A-array Area: `[T; N]` A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@programmerjake
Copy link
Member

I don't know if this is the right place for this, but I'd like to report that array::map uses the stack excessively and operating on larger arrays result in unexpected stack overflows.

Minimal example: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=2d96e8f66b3432a8d1ecf0ca4331f85a

In the example, at some point more than 16 times more stack is used than the original array size.

As far as I've looked into the actual code, the only thing that stood out to me is use of core::mem::transmute_copy instead of core::mem::transmute in core::array::IntoIter::new, but I doubt it's the reason for this.

Originally posted by @PonasKovas in #75243 (comment)

@jonas-schievink jonas-schievink added the I-heavy Issue: Problems and improvements with respect to binary size of generated code. label Jul 6, 2021
@programmerjake
Copy link
Member Author

I did some analysis for the similar code:
https://rust.godbolt.org/z/vssb3dEqx

#![feature(array_map)]

pub fn main() {
    (|| {
        [0u8; 128 * 1024].map(|e| e + 1);
        // Stack overflows, meaning the stack usage was more than 16 times the size of the array!
    })();
}

After some sed magic (I'm on my phone and rustup can't find any toolchains for aarch64-linux-android)...

Functions sorted by stack usage:

917528 core::ptr::read_unaligned::<[core::mem::maybe_uninit::MaybeUninit<u8>; 131072]>
655384 core::ptr::read::<[u8; 131072]>
655384 core::ptr::read::<[core::mem::maybe_uninit::MaybeUninit<u8>; 131072]>
655368 <core::iter::adapters::map::Map<core::array::iter::IntoIter<u8, 131072>, example::main::{closure#0}::{closure#0}>>::new
655368 <core::array::iter::IntoIter<u8, 131072> as core::iter::traits::iterator::Iterator>::map::<u8, example::main::{closure#0}::{closure#0}>
393224 core::array::collect_into_array_unchecked::<core::iter::adapters::map::Map<core::array::iter::IntoIter<u8, 131072>, example::main::{closure#0}::{closure#0}>, 131072>
131096 core::mem::forget::<[u8; 131072]>
131088 example::main::{closure#0}
56 core::mem::transmute_copy::<[u8; 131072], [core::mem::maybe_uninit::MaybeUninit<u8>; 131072]>
56 <usize as core::slice::index::SliceIndex<[core::mem::maybe_uninit::MaybeUninit<u8>]>>::get_unchecked
56 <core::option::Option<usize>>::map::<u8, <core::array::iter::IntoIter<u8, 131072> as core::iter::traits::iterator::Iterator>::next::{closure#0}>
56 <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[core::mem::maybe_uninit::MaybeUninit<u8>]>>::get_unchecked_mut
56 <core::ops::range::Range<usize> as core::iter::range::RangeIteratorImpl>::spec_next
40 core::ptr::slice_from_raw_parts_mut::<u8>
40 core::ptr::slice_from_raw_parts_mut::<core::mem::maybe_uninit::MaybeUninit<u8>>
40 <core::option::Option<u8>>::map::<u8, &mut example::main::{closure#0}::{closure#0}>
40 <core::array::iter::IntoIter<u8, 131072>>::as_mut_slice
40 <core::array::iter::IntoIter<u8, 131072> as core::iter::traits::iterator::Iterator>::next
40 <[core::mem::maybe_uninit::MaybeUninit<u8>]>::get_unchecked_mut::<core::ops::range::Range<usize>>
32 core::ptr::read::<usize>
32 core::ptr::metadata::from_raw_parts_mut::<[u8]>
32 core::ptr::metadata::from_raw_parts_mut::<[core::mem::maybe_uninit::MaybeUninit<u8>]>
32 <usize as core::slice::index::SliceIndex<[core::mem::maybe_uninit::MaybeUninit<u8>]>>::get_unchecked_mut
24 core::ptr::read::<u8>
24 <core::ops::range::Range<usize> as core::iter::traits::iterator::Iterator>::next
24 <core::mem::maybe_uninit::MaybeUninit<[u8; 131072]>>::zeroed
24 <core::iter::adapters::map::Map<core::array::iter::IntoIter<u8, 131072>, example::main::{closure#0}::{closure#0}> as core::iter::traits::iterator::Iterator>::next
24 <core::array::iter::IntoIter<u8, 131072> as core::iter::traits::iterator::Iterator>::next::{closure#0}
24 <[core::mem::maybe_uninit::MaybeUninit<u8>]>::get_unchecked_mut::<usize>
24 <[core::mem::maybe_uninit::MaybeUninit<u8>]>::get_unchecked::<usize>
16 core::mem::forget::<core::array::collect_into_array::Guard<u8, 131072>>
16 <usize as core::iter::range::Step>::forward_unchecked
8 example::main::{closure#0}::{closure#0}
8 example::main
8 core::ptr::write::<usize>
8 core::ptr::drop_in_place::<core::iter::adapters::map::Map<core::array::iter::IntoIter<u8, 131072>, example::main::{closure#0}::{closure#0}>>
8 core::ptr::drop_in_place::<core::array::iter::IntoIter<u8, 131072>>
8 core::ptr::drop_in_place::<core::array::collect_into_array::Guard<u8, 131072>>
8 core::intrinsics::write_bytes::<[u8; 131072]>
8 <core::array::iter::IntoIter<u8, 131072> as core::ops::drop::Drop>::drop
8 <core::array::collect_into_array::Guard<u8, 131072> as core::ops::drop::Drop>::drop
8 <*const u8>::read
8 <*const [u8; 131072]>::read
8 <&mut example::main::{closure#0}::{closure#0} as core::ops::function::FnOnce<(u8,)>>::call_once
0 core::hint::unreachable_unchecked
0 <usize as core::cmp::PartialOrd>::lt
0 <usize as core::clone::Clone>::clone
0 <[core::mem::maybe_uninit::MaybeUninit<u8>]>::as_mut_ptr
0 <*const [core::mem::maybe_uninit::MaybeUninit<u8>]>::as_ptr

@PonasKovas
Copy link
Contributor

Someone analyzed a similar problem on reddit: https://www.reddit.com/r/rust/comments/oeqqf7/unexpected_high_stack_usage/

@nagisa
Copy link
Member

nagisa commented Jul 6, 2021

Alas, the high stack usage when using large types by value, including arrays, is going to be a commonly recurring theme. There's a lot of reliance on optimization here to remove intermediate locals and to inline functions so that copies when passing arguments aren't necessary.

@inquisitivecrystal inquisitivecrystal added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jul 19, 2021
@workingjubilee workingjubilee added the A-array Area: `[T; N]` label Oct 6, 2021
@wrenger
Copy link

wrenger commented Dec 8, 2021

The implementation of array::map seams fairly complex. Maybe a more trivial solution is optimized better?
Something like that:

pub fn map<F, U>(self, mut f: F) -> [U; N]
where
   F: FnMut(T) -> U,
{
    let mut result: [U; N] = unsafe { std::mem::zeroed() };
    for (i, x) in self.into_iter().enumerate() {
        result[i] = f(x);
    }
    result
}

(using the 2021 edition into_iter)

@newpavlov
Copy link
Contributor

The issue seems to be resolved on Nigthly.

@programmerjake
Copy link
Member Author

programmerjake commented Mar 3, 2023

not necessarily, the stack usage went from >16x the array size to >8x the array size, which imho is still excessive.

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e29ee71b25956cfd144f8aa258298740

@newpavlov
Copy link
Contributor

You are using debug mode. Can you create an example which would work in release mode? Ideally it should be a godbolt link with enabled optimizations which clearly demonstrates unnecessary stack usage.

@newpavlov
Copy link
Contributor

newpavlov commented Mar 3, 2023

@wrenger
The complexity comes from panic safety, i.e. the code needs to properly clean initialized elements if f panics and U implements Drop.

@programmerjake
Copy link
Member Author

here's 3x stack usage in release mode:
https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=399ee5cf1e40d4c1be0181971fdd1b25

ideally it would be 2x if neither the source nor destination array of the map were removed.

@programmerjake
Copy link
Member Author

compiler explorer demo of 3x stack usage:
https://rust.godbolt.org/z/zdYh9vxMa

array size of 100kB:

example::func:
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     eax, 300024
        call    __rust_probestack
        sub     rsp, rax // <-- reserve 300k of stack space
        ...

@newpavlov
Copy link
Contributor

I think this link can be a slightly better demonstration: https://rust.godbolt.org/z/Wd7Ee8fzG

We can clearly see that the map-based code for some reason does the following:

  1. Copy input array from reference to stack.
  2. Perform mapping from the copied array to another stack region.
  3. Copy the resulted data to another stack region again and pass reference to it to the extern function.

There are no side effects which prevent compiler from properly optimizing this code.

Somewhat relevant issue: #88930

@khoover
Copy link

khoover commented Apr 15, 2023

I have an example here where stack space for map is always 2*size_of::<[T; N]>() + size_of::<[U; N]>().

  • make_terrain and make_terrain_zeroed both use map, and both exhibit this behaviour (make_terrain_zeroed was checking whether the problem was in map or MaybeUninit::assume_init's move not being elided properly).
  • make_terrain_nomap uses a for i in 0..N loop, and also exhibits the extra stack space being used (I think the elision of assume_init's move isn't happening there, separate issue)
  • make_terrain_nomap_ref replaces assume_init with assume_init_ref and achieves the desired generic stack usage of size_of::<[T; N]>() + size_of::<[U; N]>()
  • make_terrain_inplace is a specialization I'd like to see happen for U: !Drop, T: !Drop, size_of::<T>() == size_of::<U>(), align_of::<T>() == align_of::<U>() whenever specialization is ready (or at least U: Copy, T: Copy for now)

@nikic nikic added I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-array Area: `[T; N]` A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants