-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fk refactor #1936
Fk refactor #1936
Conversation
5ff498f
to
d794992
Compare
ad94132
to
5d742c9
Compare
9375dec
to
16551f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that if you run 1 replica in a GPU it would be faster, right?
(otherwise it makes no sense that running 1 replica would be slower than running 100).
n3fit/src/n3fit/layers/DY.py
Outdated
# the masked convolution removes the batch dimension | ||
ret = op.transpose(self.operation(results)) | ||
return op.batchit(ret) | ||
self.compute_observable = compute_observable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as before, having these function as module level functions would be great
I would also expect a speed up and memory reduction since they will be shared by different observables... You might want to put a @tf.function
decorator on them.
... unless the speed up is coming from a memory-tradeoff by having these functions be observable specific? But I would hope not...
if that were the case you can still compile them when you attach them to the given layer by doing
self.compute_observable = @tf.function(compute_observable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a decorator or not shouldn't matter, as they're already used in a call which should get compiled, but I can do a test.
I don't think the speedup has to do with them being observable specific, at least not intentionally.
n3fit/src/n3fit/layers/observable.py
Outdated
@@ -42,36 +43,104 @@ def __init__(self, fktable_data, fktable_arr, operation_name, nfl=14, **kwargs): | |||
super(MetaLayer, self).__init__(**kwargs) | |||
|
|||
self.nfl = nfl | |||
self.num_replicas = None # set in build | |||
self.compute_observable = None # set in build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about compute_observable
being a function with signature (pdf, masked_pdf) that need to be overwritten by any children of this class.
Actually, maybe it makes sense to make compute_observable
be an abstract method and then in DIS.py
and DY.py
the choice of which (outside) function to use becomes
def compute_observable(self, pdf, fk):
if self._one_replica:
return _compute_dis_observable_one_replica(pdf, fk)
return _compute_dis_observable_many_replica(pdf, fk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the comment, wanted to avoid these if statements in the call. It probably doesn't matter but I thought it looked cleaner. But can make it an abstract method if you prefer.
Thanks for the comments, I addressed them all, will do some timings after these changes, and with vs without a decorator on the compute_observables, as well as 1 replica on the GPU. |
I have no idea why, but the 1 replica case became significantly faster: 92s (vs 118 before, both on CPU). 1 replica on the GPU takes 65 seconds, indeed faster than 100 of course, but the scaling is great. The |
It might be able to compile the functions differently now? Maybe playing with the
Indeed, a factor of 100 in exchange for a factor of 1.5!!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good, thank you!
I have just one question... why is masking the PDF instead of the fktable faster in the case of one replica? Could've this been also a fluke of the cpu profiling?
It might be worthwhile to test again because part of the complexity here is coming from that. If that limitation is lifted this would be super clear and elegant!!
n3fit/src/n3fit/layers/DIS.py
Outdated
Same operations as above but a specialized implementation that is more efficient for 1 replica, | ||
masking the PDF rather than the fk table. | ||
""" | ||
# TODO: check if that is actually true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are already testing the timings, and we are already free from the yoke of 4.0.9
(which was supposed to be 7... so only two versions out, not bad)... why don't you remove the conditional and try to use the "multireplica" version in all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, just tested using the many replica version in all the observables. On the GPU it's actually faster, 53 seconds. On the CPU it's way slower, 330 seconds.
There are 2 factors here, one is just einsum vs tensordot, another is the order of contractions for DIS.
Masking the PDF has the downside that it cannot be precomputed, but it has the upside that it reduces the number of flavours. Masking the fk table only has to be done once, but it enlarges it (perhaps masking is not the best word in this case, it's expanding it into all flavours with zeroes).
I've tested also the approach of masking the fk table but using tensordot for 1 replica on the CPU
def compute_dy_observable_one_replica(pdf, masked_fk):
pdf = pdf[0][0] # yg
fk_pdf = op.tensor_product(masked_fk, pdf, axes=[(3, 4), (0, 1)]) # nxfyg, yg -> nxf
observable = op.tensor_product(fk_pdf, pdf, axes=[(1, 2), (0, 1)]) # nxf, xf -> n
return op.batchit(op.batchit(observable)) # brn
This takes 177 seconds on CPU, and again 53 on the GPU.
I don't fully understand these timings, but unfortunately we'll have to keep the branching, unless you're prepared to completely do away with CPU runs, but I don't think that's the case.
edit: for clarity: 1 replica timings
CPU\GPU | einsum | tensordot |
---|---|---|
mask pdf | - | 92 \ 65 |
mask fk | 330 \ 53 \ | 177 \ 53 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masking the fk table only has to be done once, but it enlarges it (perhaps masking is not the best word in this case, it's expanding it into all flavours with zeroes).
ah!! Okok, I was misled by the masking word. From this everything else makes sense.
Sadly we cannot avoid cpu runs because it will still be what most people use.
Could you add (basically this comment as a comment) at the top of the DIS.py
module for instance with the version of tensorflow that you used.
We can then revisit it in the future. The important thing is that in the 1 replica case using einsum adds a factor of 2 (and the multireplica needs einsum). And since we already have the branching, we might as well mask the PDF for an extra factor of two.
Thank you for these checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something, is that what you meant?
Perhaps I should change masked_fk
to something like padded_fk
?
Also did a quick check in multireplica. Using observable code as it is now (so tensordot and masking pdf), but using einsum in multidense, it comes out as 115 seconds, so also slower.
I've rewritten it as |
Even better. imho mask was ok once the clarification was added |
oh, I thought this was tracking master. is it fine for you if I rebase this one on master, and then the fix to 2.16 on this one? |
Yep that's fine, I rebased on master a while ago, but not recently. |
Add timing comment additional comment Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
The idea
It's a relatively small change, only affecting the observable layers, changing a bit the order in which indices are contracted, and changing from a boolean mask to a float mask.
Performance
Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.
Profile
The validation step will be addressed in #1855 and the gaps in #1802.