-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the parallelization axis #111
Comments
Sounds good. Just to keep it on record: this should definitely be nested threading, such that one can also parallelize over the block pairs if possible. Even in the use-case at hand, having a 2x24 parallelisation is probably beneficial... |
I took out all the parallelization in the actual contractions for now and re-introduce it as focused parallel sections from the bottom. I thought that we could pull out the parallel regions and do some things redundantly and only collaborate on the for loops. But that would require a bit more synchronization in the data structures, so I want to do it in small steps. In OpenMP terms we would have a parallel section outside and then use teams for the loops over the correlators? And one would choose the team size to be the total number of threads to get minimal memory usage? And a team size of 1 to have the current state of the code? |
Agreed.
I think that teams might be overly complicated for this. Instead, one would have nested parallel regions, the one at the block pair level would spawn
The above, suitably modified to have integer-valued loop counters, should be perfectly valid code. Of course, |
Disregarding the complication of synchronizing the cache(s) [which may prevent this simple approach, not sure], it should be sufficient to modify the top-level parallel section and to parallelise the assemble function. |
Note that the second parallel section can also be inside a function, no problems there. |
pps: |
I think I got the basic part right, but the tests are slower on my laptop now:
Caveats:
But I guess with all the finer parallelization and the endless locks this is no suprise. We can change the final accumulation to atomics rather easily, I had tried that already. Actually it might still be in there and now have locks around atomics. The memory scaling should be much better now, additional threads would not consume more memory in the cache. But the parallelization is horrible, still:
I have the impression that one should go through the list of correlators without doing anything and just collect the requests. And then build them eagerly upfront with many threads. Then the cache is locked and every thread can happily retrieve things. Does that sound sensible? I guess there is no real alternative to do it in parallel. |
You can see the difference here: Q0Q2-optimization...other-parallelization |
Is it possible to sort the correlator requests into disjoint sets according to which intermediates they need? Of course, this again goes into the direction of building a dependency tree... |
On Travis CI it fails because the older OpenMP cannot do |
I do not think that one can really make disjoint sets out of them in a trivial way. The dependency tree would be needed here. But going through the list twice, once triggering the cache to build and then actually retrieving them does not sound too hard, actually. |
But this would still involve locking at the building stage, right? Well, I guess it might help somewhat. |
No, it would not. Take this one here: std::vector<DilutedFactor> const &get(Key const &time_key,
std::array<ssize_t, 2> const &key) {
TimingScope<4> timing_scope("DilutedProductFactoryQ0Q2::get");
std::vector<DilutedFactor> *result;
#pragma omp critical(DilutedProductFactoryQ0Q1_get)
{
if (Q0Q2_.count(time_key) == 0 || Q0Q2_.at(time_key).count(key) == 0) {
build(time_key, key);
}
result = &Q0Q2_.at(time_key).at(key);
}
return *result;
} Instead I picture something like this:
Then the products could just run just as they are, with OpenMP added: for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
const auto &c_look = diagram_index_collection[i];
Tr[{t1, t2}][i] =
factor_to_trace(df1[{t1, b2}].at({c_look[0]}), df2[{t2, b1}].at({c_look[1]}));
} The cache would just serve whatever it has available and it would be everything because everything had already be built. |
This is how that would look like with OpenMP: #pragma omp parallel for
for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
auto const &c_look = diagram_index_collection[i];
auto const &value =
factor_to_trace(df1[{t2}].at({c_look[1]}), df2[{b2, t1, b2}].at({c_look[0]}));
#pragma omp critical(DilutedTrace2Factory_build)
Tr[{t1, t2}][i] = value;
} And even there we could think of replacing the map with something that would not need to be locked. Or pre-populate the map. |
Wait, I'm shooting myself in the foot as I don't have a clear plan. I just parallelized the building of the cache elements. So I can remove a bunch of synchronization stuff again. |
I agree, I misunderstood what you were planning to do. In some sense, you are processing the dependency tree "depth-by-depth", ignoring disjointness (which is fine). |
We have the problem that I have introduced only a Q0Q2 cache but not a Q1Q1 cache. This means that only charged diagrams have this optimization and the non-charged ones not included in this. The code therefore is in a similar state before I started working on it: Only the things that the current PhD needed were fleshed out, the remainders left aside. But I will keep it this way and focus on the changes needed for the I=3 project and worry about generalizations later. I'll just call them straightforward. There is one more layer that I missed so far in our discussion an hour ago, and it is exactly this cache for the relevant diagrams. The call chain is the following, from top to bottom level.
We have three levels of cache and the middle one also needs the quantum number indices as keys. There are the lookup tables which are not exactly part of this cache. So thinking in the graph goal terms, we have the dependency information spread out in our cache The plan as discussed with Bartek today is to try to make the caches first gather all the requests in one iteration of the whole diagram assembly code. Then to eagerly build all the needed things from the bottom to the top levels using OpenMP concurrency. I sketched this in my head so far, I'll add a second This brings us also further towards the dependency graph, so this is incremental work towards the full graph. |
I have added the explicit df1.request({t0, b1});
df2.request({t1, b2});
df3.request({t2, b3});
df4.request({t3, b0});
df1.build_all();
df2.build_all();
df3.build_all();
df4.build_all(); This of course does not change the performance of the code, but I can now work on splitting that further up the chain. |
During the train rides I added One of the troubles that I found is that the trace and factor factory just receive the time slices that the various operators should sit on. The There seem to be only few time combinations for the various objects. For the QQ products you have three times, for the tr(QQ) there are two times. And for the tr(QQQQ) there are four times. But as we only have one source and one sink time slice we actually only have a handful combinations there. This means that just parallelizing over the internal time degrees of freedom with threads will not saturate a machine like JEWELS with 48 cores per node. We need to parallelize over the Dirac structures (only one for 3pi) and the momenta (lots for 3pi). But as it is a two-level thing, I would have a parallel for loop for every of these time slice combinations. I'd wish that they were more easily accessible, but that will be another round of reworking. In the process I have broken the diagrams not needed for 3pi and I don't even recall what exactly I did there. But they use the Also it hit me that before I started all this refactoring we basically had everything in lookup tables which told us what we need to compute. Now I have spent effort in making half of the code lazy only to find that we want to make it eager again. I know that as part of this we have gained more unification and abstraction, but it feels like I am going backward again. Regarding the general strategy, I'd like to discuss that with you, @kostrzewa, when you have some time. |
// Build the diagrams.
for (auto &diagram : diagrams) {
if (diagram.correlator_requests().empty()) {
continue;
}
TimingScope<1> timing_scope("request diagram", diagram.name());
for (auto const slice_pair : block_pair) {
int const t = get_time_delta(slice_pair, Lt);
diagram.request(t, slice_pair, q);
} // End of slice pair loop.
} // End of diagram loop.
q.build_all();
for (auto &diagram : diagrams) {
if (diagram.correlator_requests().empty()) {
continue;
}
TimingScope<1> timing_scope("contract diagram", diagram.name());
for (auto const slice_pair : block_pair) {
int const t = get_time_delta(slice_pair, Lt);
diagram.assemble(t, slice_pair, q);
} // End of slice pair loop.
} // End of diagram loop.
q.clear(); I know have the delegation pulled out, as far as I can tell.
Next I will see where I can insert the concurrency. This time I will make a plan first, though 😄. |
This is the call graph: Using the 4⁴ test lattice on the charged diagrams only I have this profile with a code which is not parallelized at this moment: profile.pdf The time spent in
The correlator names files has 2223 lines, so removing a bit of JSON stuff will give us around 2200 correlators that were computed. I had hoped that building the parts would be clearly the most expensive part but it seems that the scheduling needs to be either parallelized or be pulled out such that it is re-used for every time slice combination. At the moment it is not, although not everything depends on it. |
I've added some more grouping elements into the profile such that we can see the time it took for the factors (Q), the products (QQ) and the traces (tr(Q…)) to build: As there are many more combinatoric possibilities for the traces I think that it is reasonable that they take the most time whereas there are limited options for the factors and products. Also the |
I have added parallelization to just the // We populate the whole map such that we can change its elements in a parallel way
// later..
for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
Tr[time_key][i];
}
#pragma omp parallel for
for (ssize_t i = 0; i != ssize(diagram_index_collection); ++i) {
auto const &c_look = diagram_index_collection[i];
auto const &l01 = dpf_.get({b0, t1, b2, t2}, {c_look[1], c_look[2]});
auto const &l23 = dpf_.get({b2, t3, b4, t4}, {c_look[3], c_look[4]});
auto const &l45 = dpf_.get({b4, t5, b0, t0}, {c_look[5], c_look[0]});
auto const &value = factor_to_trace(l01 * l23, l45);
Tr[time_key][i] = value;
} This directly makes it faster: But then we can also pull out the parallel region a bit into the void build_all() {
std::vector<Key> unique_requests;
unique_requests.reserve(requests_.size());
for (auto const &time_key : requests_) {
if (Tr.count(time_key) == 0) {
unique_requests.push_back(time_key);
Tr[time_key];
}
}
requests_.clear();
#pragma omp parallel
{
for (auto i = 0; i < ssize(unique_requests); ++i) {
build(unique_requests[i]);
}
}
} And then in the |
Yes, that was what I had thought was the plan from the beginning. I realize that the data structures are not ideal for this.
Sure, lazy evaluation kind of necessarily comes with a memory cost and a loss of control unless one can pre-compute how to partition the problem and the "evaluator" is aware of this partitioning. The nice thing about having a lazy "base" is that it can be used straightforwardly to build dependency lists, as it does now. In other words: in the loop over block pairs, we want to work lazily to derive the dependencies and then eagerly fulfill them.
That's okay, but depending on how many combinations we need to compute for the <P(t+t_i) J_mu(t_i) J_nu(t_i)> correlator (@pittlerf, @marcuspetschlies), it might be beneficial to attempt to keep all of the code running. If there are not too many combinations there, we can always use an old version of the code and use this one for 3pi only for now.
The balance might be a little different on a real configuration, it might be worth running a timing on one of the 24c48 lattices to get an idea. |
|
I managed to get a race condition, I think. So in release mode I just had this here:
Then I ran it with debug mode in GDB and got this here:
I also let it run with debug mode without GDB and it just took very long but did also produce that. As discussed with Bartek I have changed the |
Could this simply be an ordering problem? There's currently no way to know which diagram triggered this, but I have a feeling it might have been one of the *D ones, for which the smaller traces must already have been built for it to work. This would have of course been taken care of automatically in the lazy implementation. |
My run with a single time slice and single thread has finished on JUWELS. I have used this version here:
It has taken 2.17 hours to do one source-sink combination:
We learn that it has taken 77.496 GB of memory in its peak:
This is great news because the planned parallelization in the diagram assembly will not increase the RAM because the threads will share everything. This means that we can get the job done with 90 GB on JUWELS. Together with the 20% reduction that I found on Friday we likely have it done pretty well. For some reasons I have an assertion failure in the timing code but still got a timing output. I don't fully understand why, but I'm happy for it. The following is the profile output (profile-dot.pdf): This is the branch where we have the Q₂Q₀ optimization in place, but the parallelization is not changed with the requests and Unfortunately the requesting also takes some time and at the moment we do that once per time slice. Just taking the numbers from the above, 2 hours for one of the 528 combinations will indeed give us 22 hours in total assuming perfect parallelization. But with the requesting and imperfect parallelization we likely need to split up either the correlators or the time slices. I think that the latter makes more sense because then we can compute all the diagrams for a time slice combination in one go. Doing the requesting only once would make sense, but that might be a bit more work with the code. I would suggest that I first try my best with the new parallelization and gather timings as well. Then I adjust it such that one can give a time slice range and adjust the job generator to produce a few batches. This should allow us to finally start production and let me improve the performance without delaying getting results. |
Excellent news, I agree with all your conclusions. One more thing that is not clear to me: the intermediate |
Looks great, for maximum efficiency we might want to run with 2xN threads, however (where N will be between 16 and 24, I would think). This might mean having to run on the 192 GB nodes, but we have to see. I'll try to find some time tomorrow to look at the logic you mention in HISKP-LQCD/sLapH-projection-NG#26 |
At the moment it is not cached. I would think that it would require even more memory, so I'd like to postpone this issue for a bit. There is not much parallelization going on but we already have a race condition yielding sporadic segmentation faults. I have changed the profiling such that the calls made within parallel regions are colored in red. Also for the unit test it makes sense to output the profile in a vertical fashion, there is a command line option for that as well now. I was going to add more parallelization there, but I first need to find the cause of that race before complicating the problem. |
I think I understood and fixed that face. But I'd appreciate some code review at some point. I have parallelized the whole building part now, as you can see in the red marked regions here: The assembly can be parallelized as well, I have already started with an |
The job with the devel QOS has not been able to finish within 2 hours. The other one is at it for 13:16:19 hours now and currently in a single-threaded section of the code. That could either mean requesting or assembling. But since it it only at 33 GB of memory usage on that given node, I fear that it is still in the requesting phase. The timing level is set to 3, so it should not busy itself with too many timings. The I would think that the requesting phase needs to be sped up as the next action, right? |
Seems so... I'm still surprised at how slow this seems to be.. |
How can it be, however, that the requests don't take very much time at all for the test lattice? |
In |
strike that.. |
As you can see from the last profile from the test lattice on my laptop, the requesting phase is the most expensive part of the program. The number of various entities varies between the test and the 3pi project, but we have around 2100 correlators in the test and 250k in the real thing. So it is a factor 120 more work to do. On my laptop the test takes 11 seconds there. Assuming O(N) scaling, it should take 21 minutes on JUWELS. Assuming O(N log(N)) we would have 35 minutes. If for some reason the whole thing scales as O(N²) then the expectation is 43 hours. I don't really see where the algorithm should be quadratic. We just iterate through all the correlators, parse them, and schedule the parts into As far as I recall I have compiled it with
Perhaps the missing |
Now that we fixed that stupid O(N²) thing and made it O(N log(N)), it ran through! This is a single block pair (so a few time slices given the dilution), and it took just ten minutes and 74 GB of memory:
And this is the profile for this thing: profile-dot.pdf We can see that the time spent in The speed-up of the parallel parts can perhaps be increased still, perhaps there are really silly things in there. But it seems to be in reach now, without reducing the amount of physics that we get. |
Looking very good. This means that we can do the physical point lattice with about 2 million core-hours if things go as planned, such that we should count on about two months of real runtime, although perhaps it can be done in one month by going into negative contingent a little. |
I have just tried it with and without SMT on JUWELS. For some reason the selftime of It does not really matter, though. The time spent in the building of the diagram is 78 seconds with SMT and 80 seconds without SMT, so we can just ignore the SMT. Also with 78 seconds per block pair we aim for 14 hours per configuration. I'll submit a job for production now! 🎉 |
I would guess that SMT will almost certainly hurt you on Skylake for this problem type. Definitely just one thread per core. |
Actually, disregard that comment. It's not at all clear cut, as some of the memory traffic involves very small objects, it might be that using SMT actually allows one to saturate memory bandwidth better. In any case, the difference, up to the self time, is small. |
The run using every third source time slice on JUWELS has finished. This is the profile from the run without the 20% reduction, taking around 13 hours: It has taken 11 hours with the 20% reduction in the correlator names. I will have adjust the analytic prescriptions and then I can generate an overview over the correlator matrices that have been generated. The assembly of the diagrams has taken several hours now. It is good enough such that we can start production on the regular nodes on JUWELS. I could then also parallelize the assembly and reduction such that we get a little more gains, but I will do that when the contractions are running with the current version. Memory usage is 38 GB with the reduced one. From the profile it seems that caching the QQQQ objects could be beneficial as well. I am not sure whether they would fit into the memory, though. |
Looks good!
Agreed, production can proceed now with improvements in the future.
Perhaps one could churn through the combinatorics and estimate the memory load? Another avenue for slight optimisation might be to run with 2x24 threads (rather than 48 threads), parallelising in a nested fashion over the block pairs and the combinatorics, respectively. I would hope that this would improve memory locality, reducing cross-socket accesses. |
Jobs are queued, I'll see about space to line up more already.
The numbers are easy. I just took all the C6cC correlators, split them into pairs of operators and see how many multiplications are needed. Without the cache, we need two multiplications per correlator from the already available QQ objects. Then I saw how many unique QQQQ objects I can form from the first two pairs of the correlators. These need one multiplication each, and then one multiplication with the remaining QQ object. The results are then these:
This is a 26 % decrease in the number of multiplications. We cannot reuse from the C4cC because there the outer random vector indices are the same because we want a trace. I am not sure how large these intermediate QQQQ objects are each. They contain a vector of |
We actually do not need a cache. We just need to order the requests by the QQ objects that they use. Then we can multiply the first two and keep them, do all the multiplications with the third. When the second (or the first) changes, we need to form a new intermediate product. This needs constant additional memory. Yet another thing towards the graph. |
@martin-ueding can you remind me what the status of this is? Is it generally applicable now for any type of diagram? |
As there is only a Q0Q2 optimization this will not work for any sort of neutral diagram and also I believe that the C3c and C3cV also won't work. It should not be too hard to extend this, but at the moment only the work needed for three pions is done. |
Alright, thanks. So for the rho contractions we should use the tagged version as we did a few months ago. |
Actually, which version was that? There's no obviously named tag. @pittlerf, @matfischer , do you remember the exact commit that you used for the rho contractions? |
Exactly. With the smaller number of intermediate objects that should still fit into memory with a reasonable number of threads. Isn't the |
I see, I thought you had partially merged the changes already. Noted. |
Hi, I used the last commit from the rho_nf2 branch |
I have merged the Q0Q2 optimization and the change of the parallelization axis as just the Q0Q2 optimization alone was not usable due to the memory scaling. |
Ah, yes, we have branches (duh...) |
Currently we parallelize over the time slices and keeping a cache per thread. This means that the memory usage depends linearly on the number of threads with a pretty big slope. It is so steep that we cannot use it on machine likes JUWELS where we just have a few GB of memory per core.
The solution could be the full graph treatment, see #102. But instead we could just try to parallelize over the correlators first. For the 3pi project we have of the order of 250k correlators that we want to compute. We could just parallelize over these instead. This would mean moving a few
#pragma omp for
and also making sure that the cache is synchronized. I will first try that with an#omp critical
and see how bad the performance is. Perhaps we can get a synchronized map from Boost or something.I'll first go through the code and see what we would need to do.
The text was updated successfully, but these errors were encountered: