Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi dense layer #1905

Merged
merged 23 commits into from
Feb 19, 2024
Merged

Multi dense layer #1905

merged 23 commits into from
Feb 19, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Jan 9, 2024

Main idea: MultiDense

As discussed already in several places, the point of this PR is to merge the multiple replicas in the tightest way possible, which is at the level of the Dense layer, here implemented as a MultiDense layer.

The essence is this line and the lines around it. We extend the layer's weights from shape (in_units, out_units) to shape (replicas, in_units, out_units), or (r, f, g) for short.
The initial input at the first layer does not have a replica axis, its shape is (batch, n_gridpoints, features).
In this case the linked lines become einsum("bnf, rfg -> brng").
Every layer thereafter will have a replica axis. This simply adds an "r" to the first term in the einsum, to give einsum("brnf, rfg -> brng"). What this does is it uses the weights of replica i on the ith component of the input, that is on the previous layer's output corresponding to the same replica i. So it acts identically to the previous case, just more optimized.

Weight initialization: MultiInitializer

After all the refactorings before this, it is quite simple to initialize the weights in the same manner as is done now. A list of seeds is given, one per replica, along with an initializer which has a seed of its own, that the per replica seeds get added to. (So we can differentiate the different layers). A custom MultiInitializer class takes care of resetting the initializer to a given replica's seed, creating that replica's weights, and stacking everything into a single weight tensor.

Note that many initializers' statistics depend on the shape of the input, so just using a single initializer out of the box not only will give different results because it is seeded differently, it will actually be statistically different.

Dropout

Naively applying dropout to multi replica outputs will not consistently mask an equal fraction of each replica.

A simple and sufficient solution is to define dropout without the replica axis, and just broadcast to the replica dimension.
This is actually sort of supported already, you can subclass the Dropout layer and override the method _get_noise_shape, putting a None where you want it to broadcast.

Note that while this would turn off the same components in every replica, there is no meaning or relation to the order of the weights, so that should be completely fine.

Update: Actually, this is not necessary at all. What I thought was that dropout always sets a fixed fraction to zero, but actually it works individually per number, so it is completely fine to use the standard dropout.

Integration

I'm not sure what the best way of integrating this into the existing framework is, what I've done now is to create an additional layer_type, "multi_dense", that will have to be specified in the runcard to enable this. Previous behaviour with both layer_type="dense" and layer_type="dense_per_flavour" should be unaffected, the overhead to keep it like that is managable.

The upside of course is that if later changes become too complicated with this layer, you can always go back to the standard one.
The downside though is that it creates yet another code path, and everything will have to be tested separately.

Alternatively it could just replace the current "dense" layer type entirely, not sure if there is a nice middle ground.

Update After discussing briefly with Roy, we agreed it's not necessary to keep the old dense layer. Later I saw that actually it kind of is, as that is used under the hood in "dense_per_flavour" as well. So I have renamed that into "single_dense", and the new layer here as just "dense".

Tests

I have two unit tests, one shows that weight initalization is identical to standard dense layers. The second shows that the output on a test input is the same, up to what I think are round off errors.

Currently the CI is passing almost completely, with the only exception of a single test in python 3.11, a regression test, where one of the elements has a relative difference of 0.015, which is bigger than the tolerance of 0.002.
I assume this is just an accumulation of round off differences, I have no idea what else it could be.

Comparison with new baseline: https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/

Timings

I have done some timing tests on the main runcar (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. For completeness I'm also comparing to an earlier PR which made a big difference on performance, and the state of master just before that was merged. I still need to run the current master on the GPU.

branch commit hash 1 replica 100 replicas diff 1 diff 100
master 5eebfba 96 860 0 0
replica-axis-first 8cbe0cf 96 505 -1 355
current master f40ddd9 116 ?? -19 ??
multi-dense-layer d8f28ff 112 304 3 201

Status:

I need to do full fit comparisons, apart from that it's ready for review.

@APJansen APJansen force-pushed the parallel-prefactor branch from c650cf3 to 5453531 Compare January 9, 2024 14:42
@APJansen APJansen mentioned this pull request Jan 9, 2024
@APJansen APJansen self-assigned this Jan 9, 2024
@APJansen APJansen added n3fit Issues and PRs related to n3fit escience labels Jan 9, 2024
@APJansen APJansen mentioned this pull request Jan 10, 2024
3 tasks
@APJansen APJansen force-pushed the multi-dense-layer branch 2 times, most recently from e0fc2e1 to 472b674 Compare January 11, 2024 10:50
@APJansen APJansen added the run-fit-bot Starts fit bot from a PR. label Jan 11, 2024
@APJansen APJansen added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Jan 11, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Base automatically changed from parallel-prefactor to master January 26, 2024 09:59
@APJansen APJansen force-pushed the multi-dense-layer branch 3 times, most recently from 16f8d53 to 4dca563 Compare January 26, 2024 11:35
@APJansen APJansen added run-fit-bot Starts fit bot from a PR. redo-regressions Recompute the regression data labels Feb 16, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the reference fit for the fitbot report with NNBOT-1c97e2a73-2024-02-16

@scarlehoff
Copy link
Member

tbh, having regression tests for "exactly the same fit" we might want to keep the fitbot fixed between tags (unless big changes happen) as to show the cumulative change...

We should discuss this point in AMS

@RoyStegeman
Copy link
Member

Putting it that way there are two use-cases for the fitbot:

  1. checking the cumulative change between tags
  2. checking the impact of a single PR (which in some cases we want to be 0 or otherwise extremely small, not just statistically equivalent)

Given that to assess point one we will most likely end up looking at a global fit anyway while using the latest published version as a reference, I think the fitbot provides more value for point 2.

We could of course also ask the bot for two reports (assuming that doesn't break the github action time constraints)...

We can indeed discuss in AMS and leave it for now

@scarlehoff
Copy link
Member

scarlehoff commented Feb 17, 2024

We could of course also ask the bot for two reports (assuming that doesn't break the github action time constraints).

tbh, this is key. There was a time when we were hitting the time constrain (which is 6 hours I think?) and the bot is the maximum we could get away with under that time. Now it takes about one hour so we can safely add more reports.

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have further comments in addition to those that have been raised above. It looks great to me!

@APJansen
Copy link
Collaborator Author

Great, thanks for the reviews everyone, so we leave the fitbot as is for now and I can merge this?

Btw, about the redo-regressions workflow, it's not working perfectly, in that it creates a new commit but doesn't trigger the other tests again, so if it's the final commit the PR cannot be merged. (Here I put the label on while the tests were just starting, and they do continue but you have to go to the previous commit to see them). Not sure what the best solution is, but the simplest which I did here is just to make another (trivial) commit.

@scarlehoff
Copy link
Member

I think it is fine to force merge the PR provided the previous checks all passed other than the regression. By the time the regression label is used the PRs should be well tested.

@APJansen APJansen merged commit 9360153 into master Feb 19, 2024
8 checks passed
@APJansen APJansen deleted the multi-dense-layer branch February 19, 2024 10:05
@RoyStegeman
Copy link
Member

The bot probably doesn't have the right privileges, similar to people who are not members of the NNPDF github organisation. A solution would probably be to create a token with those privileges and allow the github action to use that when pushing.

@scarlehoff
Copy link
Member

No, I think github actions cannot trigger more actions by design.

@RoyStegeman
Copy link
Member

RoyStegeman commented Feb 19, 2024

Yes because it doesn't have the permissions. I think something like this fine-grained-personal-access-token might solve it. That may make things more risky if we're not careful

@scarlehoff
Copy link
Member

I think we can simply add a on_workflow_call option and do it manually the same that the label is added manually...

Yes because it doesn't have the permissions.

Maybe it has changed or I'm misremembering, but I think at some point it was simply not possible because it could easily cause an infinite recursion.

@RoyStegeman
Copy link
Member

I see, perhaps you're right, I didn't look that far into it.

@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience n3fit Issues and PRs related to n3fit redo-regressions Recompute the regression data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants