Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and improve AtomicQuadTree #5

Merged
merged 4 commits into from
Jun 24, 2024

Conversation

gonzalobg
Copy link
Collaborator

This PR simplifies the owning AtomicQuadTree and its Const/Non-Const reference types into a single "reference" type that can be easily passed to parallel algorithms by value.

It also performs some initial optimizations and simplifications on AtomicQuadTree and its algorithms:

  • The atomic_calc_mass algorithm is optimized to process one tree node per thread (threads assigned to empty nodes just exit). The last thread to process a node advances to the next one. This allow us to remove the leaf_count, which simplifies the quadtree, atomic_calc_mass, and atomic_insert. This improves performance by 1.9x on GPUs.

src/kernels.h Outdated
tree.next_nodes[0] = is_leaf<Index_t>;
tree.node_status[0].store(NodeStatus::EmptyLeaf, memory_order_relaxed);
}

template<typename T, typename Index_t>
auto clear_tree(System<T>& system, AtomicQuadTreeContainer<T, Index_t> tree) {
auto clear_tree(System<T>& system, AtomicQuadTree<T, Index_t> tree) {
// clear the tree, ready for next iteration
auto r = system.body_indices();
std::for_each_n(
std::execution::par_unseq,
r.begin(), tree.bump_allocator->load(memory_order_acquire),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tree is not properly initialised in the first iteration and the first force step does not calculate acceleration properly. Applying the for loop to the entire tree I think is fine.

Suggested change
r.begin(), tree.bump_allocator->load(memory_order_acquire),
r.begin(), tree.capacity,

Furthermore, I am concerned that this can run over the end of the iterator? I'm not sure how this works with std::views::iota, but e.g. there could be 1000 bodies, so r counts up to 1000, but there may be 4000 tree nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the std::views::iota question, std::views::iota(v) (with one argument, not two) generates an integer range that starts at the value and spans until numeric_limits<T>::max(). There are no values of type T greater than numeric_limits<T>::max(), so the number of leafs can't be larger (it'd have already overflown the integer type somewhere else).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tree is not properly initialised in the first iteration and the first force step does not calculate acceleration properly.

I've tried to fix it in this commit (505dcbc) but I'm not 100% sure I did that correctly.

Also, i'm not sure if it is better to always clean the tree up to the capacity, or up to the last allocated node, but we can explore tweaking that in a sub-sequent PR so I've left things "as is".

Start one thread per tree node, and filter out threads that do not
start at a leaf node, instead of preparing and updating a leaf node count.
@gonzalobg
Copy link
Collaborator Author

Rebased on top of main (no other changes).

@limefax limefax merged commit a4c07f7 into UoB-HPC:main Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants