Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set as optional the computation of gradient wrt data #51

Open
MatteoRobbiati opened this issue Nov 14, 2024 · 8 comments
Open

Set as optional the computation of gradient wrt data #51

MatteoRobbiati opened this issue Nov 14, 2024 · 8 comments

Comments

@MatteoRobbiati
Copy link
Contributor

There are situations (many) in which the gradient of the loss wrt to data is not needed. Computing it, on the other hand, can be very costly if the data dimension or the number of time the data appear in the model increase. We should set this as an option when dealing with PSR, or allow the users to specify whether they want to have the full gradient or only a part of it.

@renatomello
Copy link
Contributor

renatomello commented Nov 15, 2024

This is something that has been a topic of discussion @ TII in general, so it is something of interest. How would one go about estimating only a subset of the gradient components? Would it be a user input or is there an algorithm that automatically finds the most important directions?

@alecandido
Copy link
Member

Hi @renatomello, my perspective is that your question is potentially interesting, but rather complex: to automatically decide the evaluation of an incomplete gradient, in general, you have to know something more about your function, and encode it in assumptions.
E.g. you may approximate to second order in a point, and update the curvature less often than the gradient, using the information to prune in advance the flatter directions. Or you can use some kind of momentum technique, and just use the gradient at a previous step to decide which parameters to exclude, and reintroduce some of them randomly.
This game is made of assumptions and computational trade-offs, and there may be already research in this direction (given the worldwide focus on AI, I'd argue there is, almost for sure... probably a lot...).

However, I believe @MatteoRobbiati's proposal to be much simpler: data are not trainable, and (usually) you never compute gradients on them in classical ML.
Simply, they may be lacking this distinction in Qiboml, so this is just a technical issue.

If I am correct (@MatteoRobbiati can confirm it), and you want to keep discussing the topic of gradient's partial evaluation, I'd suggest opening a dedicated issue (even better, a discussion, which has threads and so on - but they are not enabled on this repo, and an issue can always be converted to a discussion later on).

@MatteoRobbiati
Copy link
Contributor Author

Yours is an interesting point indeed @renatomello!

As pointed out by @alecandido, here I only wanted to underline that usually we don't need the gradient wrt data. On the other hand, the current implementation I am doing in the shift rule is computing the full gradient anyway, slowing down the execution. I think it is important to support the differentiation wrt data, but we should find a way to activate it only when required.

Regarding the subset of gradient, as suggested, I would open a dedicated issue!

@MatteoRobbiati
Copy link
Contributor Author

MatteoRobbiati commented Nov 15, 2024

Also, I had a discussion with @BrunoLiegiBastonLiegi a couple of days ago, because we are trying to figure out a smarter way to compute more complex differentiations.

We came out with the idea of customizing the Qibo's ParametrizedGate in a similar way done by Mike with the Parameter class (in particular we mentioned the possibility of constructing a possible DifferentiableParametrizedGate(ParametrizedGate) here in Qiboml) . Namely, it would be good if any gate would contain all the necessary informations about trainable parameters and features, which affect the gate itself in the form of a generic function. For example, it can happen that a parameter $a$ is uploaded in 3 different gates, an in each gate in a different way (three different combinations of data and features). If so, the hardware-compatible diffrule has to combine together three PSR , each of those adjusted by multiplying the output with the partial derivative of the "encoding function" wrt $a$.

I am already doing something similar, while very rudimental, in the PSR pull request.

This is a big discussion which I think can benefit of all your pov. If you agree, please join our technical meeting next tuesday. Or, we can also arrange a dedicated meeting.

@alecandido
Copy link
Member

About the encoding function, that's just a classical function, and it will be part of your classical model within the hybrid architecture. Still, even if it were the exact same parameter in all 3 gates, you have to compute the partial derivatives wrt all the 3 parameters, and then sum together (there is no shortcut, unless you can compute the function analytically - which would make the quantum part, and Qiboml, irrelevant for that application).
This is to say that you need no handling of classical "encoding functions" in Qiboml, the integration with the ML frameworks (i.e. exporting gradients) was already all the support required.

Instead, about the distinction between trainable and not trainable parameters, I have my alternative solution, but qibo-core is still on-hold. That solution would be going in the opposite direction, i.e. instead of adding classes wrapping constants, and then surveying the content of the Circuit, the idea was to encode all the parameters in the Circuit itself, separating trainables ones (possibly with an array of indices).
It is described in qiboteam/qibo-core#22, and I already discussed it with @BrunoLiegiBastonLiegi (see also the issue itself).
The solution should be good enough for all cases, since whenever you trace, it's the ML optimization framework the one distinguishing data and parameters (since the concept is present at high level, and it will trace the parameters during the whole computation). Instead, whenever you estimate the gradient with multiple evaluations (PSR-like), you immediately know which are all the trainable parameters, without traversing the Circuit.

In any case, most often, you just upload trainable and non-trainable parameters in different gates (according to what I've personally seen). So, despite being possibly (but very mildly) incomplete, isn't it enough to just use the ParametrizedGate.trainable attribute?

@MatteoRobbiati
Copy link
Contributor Author

MatteoRobbiati commented Nov 15, 2024

Thanks for pinning me this qibo-core discussion. I agree that such a data structure could simplify a lot the procedure and lighten the process of storage and manipulation of data and parameter. This could be in general useful.

In any case, most often, you just upload trainable and non-trainable parameters in different gates (according to what I've personally seen). So, despite being possibly (but very mildly) incomplete, isn't it enough to just use the ParametrizedGate.trainable attribute?

In any QML algorithm I did, data and parameters were combined together constructing rotation angles like: $\theta = a*x + b$. Maybe I am just a special case of QML user, but I feel this approach (especially among people trying to do applications and not theoretical QML) is widely adopted. In this case, the option of directly using the trainable attribute is not enough (even though it is the option supported right now in #42).

About the encoding function, that's just a classical function, and it will be part of your classical model within the hybrid architecture. Still, even if it were the exact same parameter in all 3 gates, you have to compute the partial derivatives wrt all the 3 parameters, and then sum together (there is no shortcut, unless you can compute the function analytically - which would make the quantum part, and Qiboml, irrelevant for that application).

I wouldn't say this. There are cases in which I just needed to sum two or three contributions.
One example comes naturally from the differentiation of more-complex-than-rotation gates. For example, in the recently used RBS gates the parameter $\theta$ affects the gate equivalently to a couple of RY rotation. In that case circuit has to be decomposed and than the derivative will be computed by summing the contributions of the two RY. Another example is the integration paper I did with Juan, where using re-uploading and being interested in computing the derivative of the expectation value wrt data, we had to apply a proper sequence of shift rules to reconstruct the prediction (and it was worth, IMHO to do that with a quantum circuit).
Correct me if I didn't get your point, but to be able to access all of these informations when applying the shift rules is something I really care.

PS: as "encoding function" I meant the aforementioned case of an angle constructed as $\theta = a*x + b$.

@alecandido
Copy link
Member

In any QML algorithm I did, data and parameters were combined together constructing rotation angles like: θ = a ∗ x + b . Maybe I am just a special case of QML user, but I feel this approach (especially among people trying to do applications and not theoretical QML) is widely adopted. In this case, the option of directly using the trainable attribute is not enough (even though it is the option supported right now in #42).

Ah no, but the point is that, if $\theta$ contains trainable parameters, then it is trainable. In the sense that you need to take the gradient of it, because it is a function of parameters of which eventually you want to take the gradient.

What I meant is whether you have a U3 in which some parameters are trainable and some others are not (not at all).

@alecandido
Copy link
Member

For the time being, I'm writing it here. I can later convert it into an issue/discussion on its own

So, the original question arose because of the need for gradients of all the layer's inputs, on top of the layer's own parameters. Which are in general required for a composable layer, but we'd like to save them when we explicitly know the gradients are not needed, because these inputs are directly used for data, for which we don't need to take any gradient (not for training - if needed, at all, for specific applications, it will be supported differently at later time).

So, to summarize the situation, there are two structures involved:

  • the circuit
  • the layer

each with its own "connectors" (i.e. interface, for external composition).
The layer interface has three sets of elements:

  1. the inputs
  2. the parameters
  3. the outputs (<- not involved in this discussion)

The Circuit's parts we're interested in are its parameters, and they can be of three types:

  1. pure data
  2. pure parameter
    • dimensions of the space in which the overall model is optimized
  3. combinations of them
    • they can come as outputs of former layers (and we will always enforce this to be the case)

Essentially, the circuit gradients are only needed wrt trainable parameters, i.e. parameters or combinations containing some parameters (even defined in previous layers).
Moreover, the pure parameters are the only ones allowed to be layer's parameters, while both the combinations and pure data will be the layer's inputs.

So, the situation is something like the following:

flowchart LR
  data
  subgraph cl["classical layer"]
    dense
    data-through[" "]:::hidden
    end
  subgraph qlwrap["quantum layer"]
    direction TB
    qlparams:::hidden
    subgraph qlinputs
      qldata:::hidden
      qlcombo:::hidden
    end
    subgraph ql
      direction TB
      circuit
      end
    end
  parameters --> qlparams -- "parameters" --> circuit
  data -- "data" --> data-through -- "data" --> qldata -- "data" --> circuit
  data --> cl
  qlinputs -- "combinations" --> circuit
  dense -- "outputs" --> qlinputs

  classDef hidden display: none;
  class qlinputs,ql hidden;
Loading

In principle, we could treat all inputs and parameters as trainable, computing the gradient wrt to all of them, and then discard those on data (since data are obviously not optimized).
However, this is just expensive, so we want to save the computation of those gradients.

Thus, what should be done is more or less the following:

  1. compute the gradient only wrt the trainable parameters
  • which may determined analyzing the inputs, and checking which are traced by the ML framework
  1. split the gradient in two parts: one will be used for the parameters, the other for the former layer outputs (which are not data)
  2. pad the gradient for the inputs with 0s for those of the inputs which are known to be data (resulting in non-trainable parameters)

Any classical manipulation of parameters that will enter the rotation angles will be encoded in classical layers (no separately traced lambda with Sympy or any other framework). We may refine later on the interface to define these separate layers.
All encodings depending on data will result in inputs, being them floats, booleans, or whatever. If needed, we could even encode booleans in rotation parameters (e.g. implementing a binary encoding through rotations of $0$ or $\pi$ radians), but it may not be required (the quantum layer can process the input at will - in any case, those parameters will not receive a gradient).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants