-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with cA2.09.48 on JUWELS #109
Comments
Yeah, one of the inconsistencies... Should be easy enough to account for -> https://github.com/HISKP-LQCD/sLapH-contractions/blob/master/src/global_data_build_IO_names.cpp |
Oh, you know where this hits? |
I would have expected a perambulator or random vector reading failure, however... |
Guess I will be looking into the traceback stuff that we had discussed a few times. |
The IO names are constructed in
Just to be sure:
|
I've answered my second question:
should be
|
Indeed, it would be useful to know where this was triggered without having to fire up a debugger... |
I had tried to compile the software with the I'm back from lunch and I got this backtrace from gdb:
The offending line ( auto const size2 = gd.quarkline_lookuptable.at("Q2").at(q[1]).rnd_vec_ids.size(); The type of that field is this: DilutedFactorIndicesCollection quarkline_lookuptable; And that type is: using DilutedFactorIndicesCollection =
std::map<std::string, std::vector<DilutedFactorIndex>>; So it is the second It could be the dilution specification, I will check that. |
I have updated the dilution specification, it shows up in the logs, but there still is the same exception. I'll run in it GDB again to make sure that it is the same spot. |
Did you add something to the function? We seem to have different line numbers (Q0Q2-optimization branch): [...]
389 int total_number_of_random_combinations_in_trQ1 = 0;
390 for (auto const &q : gd.trace_indices_map.at("trQ1")) {
391 total_number_of_random_combinations_in_trQ1 +=
392 gd.quarkline_lookuptable.at("Q1").at(q[0]).rnd_vec_ids.size();
393 }
[...] |
I have moved the perambulators like they are on QBIG, but that also did not resolve the issue:
|
Is it sure, that it is connected to the perambulators? I remember having issue, that the sign of the required momentum in the VdaggerV was different that actually was written to the disk. Could that be the case here? |
It could also be that I have the VdaggerV incorrectly. I have never used pre-generated VdaggerV, so perhaps I am doing it wrong. So in
And then in the file I have these options: delta_config = 1
#path_config = /hiskp4/gauges/quenched/wilson_b5.85_L12T24
# eigenvector handling:
number_of_eigen_vec = 660
#path_eigenvectors = /hiskp4/eigensystems/nf211/A40.24/
#name_eigenvectors = eigenvectors
handling_vdaggerv = read
path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48/cnfg0240 From the code I have these snippets that make up the file names: auto const full_path =
(boost::format("/%s/cnfg%04d/operators.%04d") % path_vdaggerv % config % config)
.str(); std::string dummy = full_path + ".p_" + std::to_string(op.momentum[0]) +
std::to_string(op.momentum[1]) +
std::to_string(op.momentum[2]) + ".d_" +
to_string(op.displacement);
auto const infile = (boost::format("%s.t_%03d") % dummy % t).str(); In case that these could not be opened there would be errors thrown. So either they are built or they can be found. |
Let's try that without the configuration number in the path! path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48 But that on the other hand means that there is no clear error when the file cannot be found! |
It seems that the path to the perambulators is not the real problem. Taking the output from some working contraction we can see what happens:
And this is missing in the new case here:
So the actual reading of the files happens later on. So it seems that it would crash no matter presence of the actual files. I am not sure how that works with the VdaggerV “read” mode here, the other contractions were in “build”. I will just try to run it with “build” and see whether that crashes later. |
Hi Martin, The VdaggerV reading happens after the perambulator reading:
|
at least in the case of the rho |
Okay, thank you! This means that something apart from the files is still wrong. Okay, I will try to figure that out. |
I have tried your code with the three pion and the rho input file as well, and the three pion crashed with the same error and the rho actually did not crashed.
|
I have tested that with
works |
Okay, thank you very much for testing this! That means that it is independent of any perambulator or VdaggerV files but there is something wrong prior to that. I will try to triage that further. |
I just realized that you were using a somewhat older version of the code from 2019-07-22. Since then a few things have happened:
I have tried with your version and it also crashes. So I would say that the error has been there all along. On my laptop it takes 384 seconds until the program crashes. I realized that I did not have the cutoffs in place. Then all the sudden it crashed after 6 seconds. The range check now fails because it hits the index 7, but the same principle applies. Just running it with Okay, then we have it pinned down pretty well, I believe. |
These combinations works: momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1 momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2 But these ones does not: momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1,2 momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2,3 This should not be the case, of course. The cutoffs should just remove all the operators that are not needed. So perhaps we have the problem that some operators are not used at all with the given cutoffs and that they therefore do not appear in some list, letting the indexing run over the vector. I'm reaching the conclusion that I do not want to do anything with that code and just outright replace it now. |
Looking just at your output file on JUWELS for guidance,
It seems to me that the problem occurs in the iteration over the "quantum number indices" for
Yes, although one should keep in mind that all the lookup table stuff in the code is solving a rather complicated organisational problem and that replacing it is similarly not so simple. Going from the output of the subduction code back to VdaggerV objects and perambulators is not totally straightforward, I think.... Also, one doesn't just have to generate the required dependencies on these basic objects at this stage, but also on intermediate results such as As it is right now, it seems that objects are requested which are not added to some list at an appropriate stage, most likely in the construction of the dependencies for If a complete redesign is not feasible, one might be able to reshape the construction of the lookup tables by making use of maps with clear names for each required object instead of having to manually ensure unique indexing (as exemplified by the requirement of the |
Sorry, I did not see that comment before.
Yes, a simple print statement which I later removed again. The spot that you name certainly is the one where the code crashes.
I still have not dug into it too deeply, so I likely underestimate the effort required. But in principle the correlators have all the information in them. Also I have the diagram specifications that tells me which Q0/Q1/Q2 objects are involved at which positions. map["C6cCD"] = {Vertices({0, 2, 4}, {1, 3, 5}),
TraceSpecs{{{"Q2", 3, 0}, {"Q0", 0, 1}, {"Q2", 1, 2}, {"Q0", 2, 3}},
{{"Q0", 4, 5}, {"Q2", 5, 4}}}}; Here you see that we have a Using this I should be able to populate the operators, the Q objects and the trace(Q…) objects that are needed bottom up. And then use the indices from these to fill the index tables of the more complex objects. |
I agree, the diagram spec should provide the necessary handle. |
Once the json based correlator input is merged, what's (roughly) the modus operandi if I wanted to run contractions for some new (or old) system? |
For a new system you would run my projection code to obtain the prescriptions. The contraction code contains an utility script which goes through all these prescriptions and extracts the dataset names and compiles them into the list of correlator names. For an existing system, like the integration tests, I used |
What if I don't have Mathematica? Just playing devil's advocate here to understand all steps. What about the fact that QCT does not have real support for the twisted mass symmetries, does that matter? |
Supporting new types of contractions then necessarily also means having to touch the subduction code, correct? |
A third option I forgot is to just generate such a list of diagram names with some other means. Basically the part of the code that I removed (and had some bug in it) could be rewritten in some other language to output such a list.
You could just personally buy an academically discounted version if your institution does not have it. I mean this has happened with Maple here once already. But if you don't have Mathematica, you could not use the projection at all. This would mean that you would need to either acquire a license or rewrite all this in some other product. Even if you use the old code and just let it generate a bunch of correlation function, you will need to know how these are to be combined. And once you have figured that out, you can also build a list. Unless you write a code which never explicitly constructs this list but uses the numeric data too soon.
It does not know anything about the Dirac matrices. I explicitly do the gamma-5 trick with pattern matching in the code. But there might be more opportunities to compute less correlators.
I would rather say that once you have new operators you need to touch the projection code. The fact that you get different contractions from that is just a consequence of that. But you are right, the projection code becomes somewhat of a dependency to the contraction code. On the other hand there is a clear interface, the JSON file with the correlator names, so one could substitute it with a different implementation. |
Thanks for the detailed explanation.
Yes, that what I meant.
For simple systems (i.e., stuff at zero momentum basically) this will be necessary to have I believe. I agree that there's a clear interface now and it should be relatively straightforward to bang something together. |
Well, for zero momentum you can just write them by hand. There is just one correlator per diagram type. On JUWELS I advanced further, now I am at this error message:
Taking a look into that directory shows that not all momenta are actually built.
The one that I want (0, 1, 1) is not available. But there is the conjugated (0, -1, -1) which I could use. I presume that somehow the conjugation of VdaggerV objects has not been resolved in the same fashion as it has been for the generation of these files. If a file could not be found I could take the opposite parity file and do a hermitian conjugation. Or in the resolving of the VdaggerV that I need I could make sure that I always have a minus sign in the first non-zero momentum component. That is likely the easier method and will use a bit less memory. |
Yes, I agree completely, this is a very good input interface for the purpose, generalizing straightforwardly to multiple flavours and simple enough to write by hand for simple cases. This approach is on the way towards a more general "diagram description language" which I thought might have been optimal for this type of rather general contraction code. At the time, I had the idea of a slightly more elaborate format which also included different modifiers for the quark lines (to support multiple smearing types, for example) and with an explicit listing of temporal and spatial coordinates, providing the ordering that is now provided by the diagram specifications internally. I dropped the idea to pursue this as there seemed to be no interest in the group at the time.
Yes, there is now way around checking both +p and -p VdaggerV files, taking the one that exists and tagging that it should be multiplied conjugated. |
It might make sense to create a map of available VdaggerV objects at startup (when in "read" mode) in order to avoid overly frequent requests to the file system. |
at least it throws now :) |
Okay, we are all set to tackle the next problem 😄😫. The code reads in the perambulators, the VdaggerV (thanks for your help in the meeting!). It starts the contractions:
But then of course, as we all have guessed, it did not finish further:
I noticed that I did not ask for any memory explicitly, so I have no idea how much I did get. |
I guess I got the full 90 GB. Logged into the node and using We likely waste a bunch of memory for the reduction in the end, I wanted to switch that over to atomics. But then still the per-thread cache is just an idiotic design for machines with this core/memory ratio. I see these options, none of them are great:
|
Try the large memory nodes and see what happens to begin with. |
Wasting resources is not really a problem if the calculation fits into our compute time budget |
Nope, 180 GB of memory are not enough with four threads either 🤷♂️. I have just implemented the atomics stuff, but that will not be much. We have a result variable for every correlator and every time slice. Before we also had another one per thread, so basically it has been this:
With 96 time slices and four threads the change is large and that is only part of memory usage. With like 250k correlators that is 4 MB per time slice or thread. Hmm, so I have only saved a couple of MB with that change, I guess. |
Besides that I looked into it when it ran for already like 40 minutes and it has been on the first time slice. Start and end are 14:24:49 and 15:33:58, so it ran over an hour to fill up the memory. But each thread did not even finish its first time slice combination out of 528. So optimistically assuming that it just did not finish by a few GB of memory, we would have an hour for 4 slices. So that would be a fun 132 hours (5.5 days) for the whole configuration. Completely infeasible. But we are using just 4 of 48 cores that the machine has. So if we would manage to use all without additional overheads it would be ×12 faster. JUWELS does not have a walltime limit, but I guess we have the usual 24h with quota and the 6h without quota? So we need to achieve ×5.5 to fit into the walltime and potentially have ×12 in resources available. Perhaps we can clear the caches more often and then it might fit and the recomputation does not kill us. |
I have now added a little change which will clear the QQ cache right after each trace. I have the feeling that this will not really change anything, because everything gets reused and therefore the stuff that is produced in each trace for a given time slice combination should just be reused completely. The second trace should not add anything to it. I have submitted that on a large memory node on JUWELS, but I expect it crash. And even if that helps, it still will not allow us to fit 48 threads in there. We need a different strategy for parallelization, this just does not scale. |
Yes, the time limits are applied via QOS at the submit stage. |
perhaps you could try to run on QBIG just to see the actual memory load as a test (using the devel QOS) the other thing that might be worth doing is to actually fix the memory load output that Markus started, that way you will be able to predict exactly how much memory you require.
and the relevant part of the cache is cleared anyway once a particular time slice pair has been computed, correct? |
There are relatively straightforward memory savings that one can obtain, although only for small numbers of concurrent threads. As the correlators are computed cumulatively on paired time slices, one might load perambulator columns and VdaggerV on given time slices on the fly only when required and then discard them. (This needs some synchronisation between threads, as there will be common elements, consider the pairs (0,0), (0,1), (0,2), (0,3), for example, when running with four threads). Even with 48 threads, this part of the memory load would be reduced markedly as one would have only a subset of the VdaggerV and perambulator data in memory at any one time. Of course, this will mean that there will be some overhead, although the I/O time for perams and VdaggerV is quite acceptable and there would be at most a couple hundred read operations. (rather than one big one) It does mean, however, that functions like |
Or rather, VdaggerV and perambulator submatrices corresponding to sets of blocks would need to be read on the fly. |
At the level of diagrams, one could use nested parallelism, threading over the correlator requests. |
In fact, how long are the lists of correlator requests on average? These should be huge, right? |
I think this might solve all of our problems, it might even lead to better load balancing. But only if these lists are actually as long as I naively assume that they might be. In practice, for systems with many momenta, one would use few threads over block pairs, but many threads over correlator requests, while for systems with few or a single momentum, one would thread over the block pairs only instead (as done in the past). |
I was thinking about parallelizing over the correlators as well. There are cheap diagram and there are expensive ones. But in each case there are lots of them:
And I would like this parallelization direction because it would mean that memory consumption would stay constant with the number of threads. The cache would need to be synchronized, but with |
Shoot, I just realized that there is no |
I did run the version with the Q2Q0 optimization with a single thread and a single time slice combination on JUWELS. So far only the version in the It was built to keep the caches, so the memory load did not exceed the 90 GB within the two hours. That does not necessarily mean that it would not do that later on, but at least there is some hope that the 24 hour jobs in the We can get a speed-up of up to 48 on JUWELS. But there are 528 time slices to run. This gives us a lower bound of 22 hours per configuration. Fortunately we can run the time slices individually and there is is just the IO that we would have to re-do every single time. We could also tweak it such that it does not load everything but only what it needs. This way we can split the jobs into more smaller jobs at the cost of having to merge and accumulate the HDF5 files that come out. I will watch the jobs on JUWELS and at the same time proceed with Issue #111 to realize that speedup. |
I'll close this ticket because running it on JUWELS now works in principles, it is just too slow and we have #111 for that issue. |
For the lack of a better place I put it just here. I am trying to run the contractions for cA2.09.48 on JUWELS. I got it compiled and I think that I have set it up correctly from the input file. The paths might be wrong as the perambulators are not in one subdirectory per random vector. So I do get the following here:
The text was updated successfully, but these errors were encountered: