jnez71 t1_ivoyij3 wrote
I suppose it is a bit closer to a secant method like BFGS, which approximates the Hessian required for a Newton step. In other words, these methods use a linear combination of adjacent gradient computations to estimate curvature, which enables more effective updates. The combination is not an average though, and also they integrate gradient computations over successive iterations rather than stopping at each iteration to compute a bunch within some volume.
I don't think your proposed volume-averaging has any theoretical utility as a convex optimizer, especially because there are much better well-known things to do with adjacent gradient computations. The closest common practice I can think of to averaging are GD+momentum type optimizers, which lowpass-filter the GD dynamic along its trajectory. These provide heuristic benefits in the context of nonconvex optimization.
Do you have any links about the volume-averaging or was it just a random thought? Also, make sure not to confuse "averaging gradients from different points in parameter space" with "averaging gradients from different sample losses at the same point in parameter space" (called "batching").
CPOOCPOS OP t1_ivp0tap wrote
thanks for your reply jnez! Yes, i have also had the thought actually of using the average of many local points to estimate the local curvature like needed in the BFGS.
You are right by saying that in a classical sense there are far better things to do with many adjacent gradient computations. But here I am doing machine learning on a quantum computer, and the interesting part is, that it is very cheap to calculate the average (and only the average) of many points. To be more concrete about the computational cost, it only takes linear effort to compute the average of an exponential amount of points.
As a start, when i was developing the idea, i just thought of the procedure as just being a vote on a bunch of local points on which direction they would like to go. But now I am looking for more concrete theoretical arguments on why it would make sense to take the average gradient (since on a quantum computer i wouldn't have this computational overhead like on a classical computer)
jnez71 t1_ivp3sju wrote
Hm, there may be a way to exploit that cheapened average gradient computation to still tell you curvature, which can help a lot.
I am reminded of how a covariance matrix is really just composed of means: cov[g,g] = E[gg'] - E[g]E[g']
(where '
is transpose). If g
is distributed as the gradients in your volume, I suspect that cov[g,g]
is related to the Hessian, and you can get that covariance with basically just averages of g
.
More intuitively I'm thinking, "in this volume, how much on average does the gradient differ from the average gradient." If your quantum computer really makes that volume averaging trivial, then I suspect someone would have come up with this as some kind of "quantum Newton's method."
I think that's all I got for ya. Good luck!
CPOOCPOS OP t1_ivp4z7j wrote
Thanks!! Wish you a good day!!
jnez71 t1_ivp6ril wrote
Oh I should add that from a nonconvex optimization perspective, the volume-averaging could provide heuristic benefits akin to GD+momentum type optimizers. (Edited my first comment to reflect this).
Try playing around with your idea in low dimensions on a classical computer to get a feel for it first. Might help you think of new ways to research it.
Viewing a single comment thread. View all comments