-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set as optional the computation of gradient wrt data #51
Comments
This is something that has been a topic of discussion @ TII in general, so it is something of interest. How would one go about estimating only a subset of the gradient components? Would it be a user input or is there an algorithm that automatically finds the most important directions? |
Hi @renatomello, my perspective is that your question is potentially interesting, but rather complex: to automatically decide the evaluation of an incomplete gradient, in general, you have to know something more about your function, and encode it in assumptions. However, I believe @MatteoRobbiati's proposal to be much simpler: data are not trainable, and (usually) you never compute gradients on them in classical ML. If I am correct (@MatteoRobbiati can confirm it), and you want to keep discussing the topic of gradient's partial evaluation, I'd suggest opening a dedicated issue (even better, a discussion, which has threads and so on - but they are not enabled on this repo, and an issue can always be converted to a discussion later on). |
Yours is an interesting point indeed @renatomello! As pointed out by @alecandido, here I only wanted to underline that usually we don't need the gradient wrt data. On the other hand, the current implementation I am doing in the shift rule is computing the full gradient anyway, slowing down the execution. I think it is important to support the differentiation wrt data, but we should find a way to activate it only when required. Regarding the subset of gradient, as suggested, I would open a dedicated issue! |
Also, I had a discussion with @BrunoLiegiBastonLiegi a couple of days ago, because we are trying to figure out a smarter way to compute more complex differentiations. We came out with the idea of customizing the Qibo's I am already doing something similar, while very rudimental, in the PSR pull request. This is a big discussion which I think can benefit of all your pov. If you agree, please join our technical meeting next tuesday. Or, we can also arrange a dedicated meeting. |
About the encoding function, that's just a classical function, and it will be part of your classical model within the hybrid architecture. Still, even if it were the exact same parameter in all 3 gates, you have to compute the partial derivatives wrt all the 3 parameters, and then sum together (there is no shortcut, unless you can compute the function analytically - which would make the quantum part, and Qiboml, irrelevant for that application). Instead, about the distinction between trainable and not trainable parameters, I have my alternative solution, but In any case, most often, you just upload trainable and non-trainable parameters in different gates (according to what I've personally seen). So, despite being possibly (but very mildly) incomplete, isn't it enough to just use the |
Thanks for pinning me this
In any QML algorithm I did, data and parameters were combined together constructing rotation angles like:
I wouldn't say this. There are cases in which I just needed to sum two or three contributions. PS: as "encoding function" I meant the aforementioned case of an angle constructed as |
Ah no, but the point is that, if What I meant is whether you have a |
So, the original question arose because of the need for gradients of all the layer's inputs, on top of the layer's own parameters. Which are in general required for a composable layer, but we'd like to save them when we explicitly know the gradients are not needed, because these inputs are directly used for data, for which we don't need to take any gradient (not for training - if needed, at all, for specific applications, it will be supported differently at later time). So, to summarize the situation, there are two structures involved:
each with its own "connectors" (i.e. interface, for external composition).
The
Essentially, the circuit gradients are only needed wrt trainable parameters, i.e. parameters or combinations containing some parameters (even defined in previous layers). So, the situation is something like the following: flowchart LR
data
subgraph cl["classical layer"]
dense
data-through[" "]:::hidden
end
subgraph qlwrap["quantum layer"]
direction TB
qlparams:::hidden
subgraph qlinputs
qldata:::hidden
qlcombo:::hidden
end
subgraph ql
direction TB
circuit
end
end
parameters --> qlparams -- "parameters" --> circuit
data -- "data" --> data-through -- "data" --> qldata -- "data" --> circuit
data --> cl
qlinputs -- "combinations" --> circuit
dense -- "outputs" --> qlinputs
classDef hidden display: none;
class qlinputs,ql hidden;
In principle, we could treat all inputs and parameters as trainable, computing the gradient wrt to all of them, and then discard those on data (since data are obviously not optimized). Thus, what should be done is more or less the following:
Any classical manipulation of parameters that will enter the rotation angles will be encoded in classical layers (no separately traced |
There are situations (many) in which the gradient of the loss wrt to data is not needed. Computing it, on the other hand, can be very costly if the data dimension or the number of time the data appear in the model increase. We should set this as an option when dealing with PSR, or allow the users to specify whether they want to have the full gradient or only a part of it.
The text was updated successfully, but these errors were encountered: