-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi dense layer #1905
Multi dense layer #1905
Conversation
c650cf3
to
5453531
Compare
59a345a
to
1375daa
Compare
c11acbf
to
997451b
Compare
1bfae4e
to
5e99568
Compare
e0fc2e1
to
472b674
Compare
ae40438
to
e44a723
Compare
c99fa13
to
b10ae0e
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
e44a723
to
25de95c
Compare
0f587a2
to
341e39d
Compare
341e39d
to
93a5636
Compare
9bcb73f
to
8810e11
Compare
c07a49b
to
4df7e69
Compare
16f8d53
to
4dca563
Compare
Co-authored-by: Juan M. Cruz-Martinez <[email protected]>
f6529ab
to
234efa0
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the reference fit for the fitbot report with NNBOT-1c97e2a73-2024-02-16
tbh, having regression tests for "exactly the same fit" we might want to keep the fitbot fixed between tags (unless big changes happen) as to show the cumulative change... We should discuss this point in AMS |
Putting it that way there are two use-cases for the fitbot:
Given that to assess point one we will most likely end up looking at a global fit anyway while using the latest published version as a reference, I think the fitbot provides more value for point 2. We could of course also ask the bot for two reports (assuming that doesn't break the github action time constraints)... We can indeed discuss in AMS and leave it for now |
tbh, this is key. There was a time when we were hitting the time constrain (which is 6 hours I think?) and the bot is the maximum we could get away with under that time. Now it takes about one hour so we can safely add more reports. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have further comments in addition to those that have been raised above. It looks great to me!
Great, thanks for the reviews everyone, so we leave the fitbot as is for now and I can merge this? Btw, about the redo-regressions workflow, it's not working perfectly, in that it creates a new commit but doesn't trigger the other tests again, so if it's the final commit the PR cannot be merged. (Here I put the label on while the tests were just starting, and they do continue but you have to go to the previous commit to see them). Not sure what the best solution is, but the simplest which I did here is just to make another (trivial) commit. |
I think it is fine to force merge the PR provided the previous checks all passed other than the regression. By the time the regression label is used the PRs should be well tested. |
The bot probably doesn't have the right privileges, similar to people who are not members of the NNPDF github organisation. A solution would probably be to create a token with those privileges and allow the github action to use that when pushing. |
No, I think github actions cannot trigger more actions by design. |
Yes because it doesn't have the permissions. I think something like this fine-grained-personal-access-token might solve it. That may make things more risky if we're not careful |
I think we can simply add a
Maybe it has changed or I'm misremembering, but I think at some point it was simply not possible because it could easily cause an infinite recursion. |
I see, perhaps you're right, I didn't look that far into it. |
Main idea:
MultiDense
As discussed already in several places, the point of this PR is to merge the multiple replicas in the tightest way possible, which is at the level of the
Dense
layer, here implemented as aMultiDense
layer.The essence is this line and the lines around it. We extend the layer's weights from shape
(in_units, out_units)
to shape(replicas, in_units, out_units)
, or(r, f, g)
for short.The initial input at the first layer does not have a replica axis, its shape is
(batch, n_gridpoints, features)
.In this case the linked lines become
einsum("bnf, rfg -> brng")
.Every layer thereafter will have a replica axis. This simply adds an
"r"
to the first term in the einsum, to giveeinsum("brnf, rfg -> brng")
. What this does is it uses the weights of replica i on the ith component of the input, that is on the previous layer's output corresponding to the same replica i. So it acts identically to the previous case, just more optimized.Weight initialization:
MultiInitializer
After all the refactorings before this, it is quite simple to initialize the weights in the same manner as is done now. A list of seeds is given, one per replica, along with an initializer which has a seed of its own, that the per replica seeds get added to. (So we can differentiate the different layers). A custom
MultiInitializer
class takes care of resetting the initializer to a given replica's seed, creating that replica's weights, and stacking everything into a single weight tensor.Note that many initializers' statistics depend on the shape of the input, so just using a single initializer out of the box not only will give different results because it is seeded differently, it will actually be statistically different.
Dropout
Naively applying dropout to multi replica outputs will not consistently mask an equal fraction of each replica.
A simple and sufficient solution is to define dropout without the replica axis, and just broadcast to the replica dimension.
This is actually sort of supported already, you can subclass the
Dropout
layer and override the method_get_noise_shape
, putting aNone
where you want it to broadcast.Note that while this would turn off the same components in every replica, there is no meaning or relation to the order of the weights, so that should be completely fine.
Update: Actually, this is not necessary at all. What I thought was that dropout always sets a fixed fraction to zero, but actually it works individually per number, so it is completely fine to use the standard dropout.
Integration
I'm not sure what the best way of integrating this into the existing framework is, what I've done now is to create an additional
layer_type
,"multi_dense"
, that will have to be specified in the runcard to enable this. Previous behaviour with bothlayer_type="dense"
andlayer_type="dense_per_flavour"
should be unaffected, the overhead to keep it like that is managable.The upside of course is that if later changes become too complicated with this layer, you can always go back to the standard one.
The downside though is that it creates yet another code path, and everything will have to be tested separately.
Alternatively it could just replace the current
"dense"
layer type entirely, not sure if there is a nice middle ground.Update After discussing briefly with Roy, we agreed it's not necessary to keep the old dense layer. Later I saw that actually it kind of is, as that is used under the hood in "dense_per_flavour" as well. So I have renamed that into
"single_dense"
, and the new layer here as just"dense"
.Tests
I have two unit tests, one shows that weight initalization is identical to standard dense layers. The second shows that the output on a test input is the same, up to what I think are round off errors.
Currently the CI is passing almost completely, with the only exception of a single test in python 3.11, a regression test, where one of the elements has a relative difference of 0.015, which is bigger than the tolerance of 0.002.
I assume this is just an accumulation of round off differences, I have no idea what else it could be.
Comparison with new baseline: https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/
Timings
I have done some timing tests on the main runcar (
NNPDF40_nnlo_as_01180_1000
), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. For completeness I'm also comparing to an earlier PR which made a big difference on performance, and the state of master just before that was merged. I still need to run the current master on the GPU.Status:
I need to do full fit comparisons, apart from that it's ready for review.