Submitted by CPOOCPOS t3_yql3wl in MachineLearning

Hello, I'm wondering if during learning tasks there is an advantage in taking the average Gradient of a small Volume in our parameter space compared to just taking the Gradient of the one center point in that parameter-space volume.

Edit:

I noticed that i have some people confused, and rightly so, because i didn't mention some aspects. The computational overhead of the average of such many points not important, why is not important but it has to do with the fact that this can be extremely cheap on a quantum computer. My question more precisely would be, given the face that computational overhead can be ignored, are there in general theoretical advantages in taking the average over a volume in comparison to just at a point.

I am a physicist and have only some experience in ML. My thesis is at the moment in Quantum Machine Learning and this question ahs been central to my research for some weeks. But unfortunately i can't find anything online that regards this question. I was wondering if someone here, would have some insights to this question.

38

Comments

You must log in or register to comment.

bluuerp t1_ivov64y wrote

The gradient gives you the optimal improvement direction....if you have 10 positions the gradient of all 10 will point in different directions...so if you take a step after each point you'll zig zag arround a lot. You might even backtrack a bit. If you however take the average of all 10 and do a step you won't be optimal in regards to all points individually, but the path you'll take will be smoother.

So it depends on your dataset. Usually you want to have some smoothing because otherwise you won't converge that easily.

​

The same is true for your example....the center point might not be a good estimate of the surrounding. It could however be that it is close to the average and there isn't that big of a difference.

33

CPOOCPOS OP t1_ivovm1t wrote

Hi and thanks for your reply! I just looked into smoothing and it seems to be a kind of data manipulations. As in, the data we have is smoothend to find trends.

Here I don't have data actually, what I am averaging over is the volume of the parameter space, where the parameters are the learnable parameters of my network.
In other words when i try to update my parameters with GD I would like to average the gradients of all points ( in the parameter space) lying closely to my center point (or the point i would take the gradient of usually

0

jnez71 t1_ivoyij3 wrote

I suppose it is a bit closer to a secant method like BFGS, which approximates the Hessian required for a Newton step. In other words, these methods use a linear combination of adjacent gradient computations to estimate curvature, which enables more effective updates. The combination is not an average though, and also they integrate gradient computations over successive iterations rather than stopping at each iteration to compute a bunch within some volume.

I don't think your proposed volume-averaging has any theoretical utility as a convex optimizer, especially because there are much better well-known things to do with adjacent gradient computations. The closest common practice I can think of to averaging are GD+momentum type optimizers, which lowpass-filter the GD dynamic along its trajectory. These provide heuristic benefits in the context of nonconvex optimization.

Do you have any links about the volume-averaging or was it just a random thought? Also, make sure not to confuse "averaging gradients from different points in parameter space" with "averaging gradients from different sample losses at the same point in parameter space" (called "batching").

4

_d0s_ t1_ivp036a wrote

besides being extremely computationally expensive, how would one define the size of the volume? it's similar to the problem of defining a step size which oversteps when too large or takes forever when too small. likewise defining a very small volume might get us caught in local minima.

i guess this thought is similar to smoothing like another poster mentioned.

5

CPOOCPOS OP t1_ivp0tap wrote

thanks for your reply jnez! Yes, i have also had the thought actually of using the average of many local points to estimate the local curvature like needed in the BFGS.

You are right by saying that in a classical sense there are far better things to do with many adjacent gradient computations. But here I am doing machine learning on a quantum computer, and the interesting part is, that it is very cheap to calculate the average (and only the average) of many points. To be more concrete about the computational cost, it only takes linear effort to compute the average of an exponential amount of points.

As a start, when i was developing the idea, i just thought of the procedure as just being a vote on a bunch of local points on which direction they would like to go. But now I am looking for more concrete theoretical arguments on why it would make sense to take the average gradient (since on a quantum computer i wouldn't have this computational overhead like on a classical computer)

1

jnez71 t1_ivp3sju wrote

Hm, there may be a way to exploit that cheapened average gradient computation to still tell you curvature, which can help a lot.

I am reminded of how a covariance matrix is really just composed of means: cov[g,g] = E[gg'] - E[g]E[g'] (where ' is transpose). If g is distributed as the gradients in your volume, I suspect that cov[g,g] is related to the Hessian, and you can get that covariance with basically just averages of g.

More intuitively I'm thinking, "in this volume, how much on average does the gradient differ from the average gradient." If your quantum computer really makes that volume averaging trivial, then I suspect someone would have come up with this as some kind of "quantum Newton's method."

I think that's all I got for ya. Good luck!

2

jnez71 t1_ivp6ril wrote

Oh I should add that from a nonconvex optimization perspective, the volume-averaging could provide heuristic benefits akin to GD+momentum type optimizers. (Edited my first comment to reflect this).

Try playing around with your idea in low dimensions on a classical computer to get a feel for it first. Might help you think of new ways to research it.

1

onedertainer t1_ivp6xgj wrote

This sounds like maybe mini-batch gradient descent, where you use the average or sum of a batch of points, or maybe something like Adam, where you use averaging over epochs to give your gradient descent some "momentum".

2

bloc97 t1_ivphtmz wrote

Instead of the average, would it be possible to compute the divergence (or laplacian) very quickly? That might lead to a faster higher order optimization method compared to simple gradient descent.

1

CPOOCPOS OP t1_ivpmo8h wrote

>divergence

Hi bloc!! thanks for your answer

By taking the laplacian, you mean taking the laplacian ( Nabla * Nabla * f) of all points and average? Yes this is also possible. Not in a single Go, but i can get the second derivative of all points for each parameter and add them up. How would that help? Or what is a higher order optimisation

1

The_humble_tortoise t1_ivpn0u5 wrote

My take from your question is basically the difference between SGD and mini-batch SGD. Batch SGD in general provides a more robust (less overfitting) and smoother ( less zig-zag) solution; but it really differs between different data. More homogenous data usually work better with SGD; otherwise mini-batch SGD will perform better.

0

bloc97 t1_ivpper1 wrote

I mean having the divergence would definitively help, as we will have additional information about the shape of the parameter landscape with respect to the loss function. The general idea would be to prefer areas with negative divergence, while trying to move and search through zero divergence areas very quickly.

Edit: In a sense, using the gradient alone only gives us information about the shape of the loss function at a single point, while having a laplacian gives us a larger "field of view" on the landscape.

1

hughperman t1_ivqd886 wrote

Consider though, in a linear scheme, taking each gradient step separately is equal the sum of the gradients. Taking the average is equal to the sum of the gradients divided by the number of steps. So you are only adjusting the step by a scale factor of 1/N, nothing more mathemagical.

3

entropyvsenergy t1_ivqltwc wrote

Batching does this, generally and it's a good thing for stability. Reduces the variance of the gradient update proportional to the batch size.

0

danielv134 t1_ivraqgm wrote

This is interesting. So over what can you average over in practice? any point + the points of an axis parallel cube? if so, can you rotate the cube? scale it non-uniformly in different axes? can you average over non-linear measurements, like the length of the gradient? I am highly interested in geometry and optimization, so happy to talk about this, feel free to direct message me.

1

make3333 t1_ivroe1x wrote

gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point. in neural nets we do first degree, as if it was a plane. in a lot of other optimization settings they do second order approx to find the optimal direction

10

Difficult_Ferret2838 t1_ivrom17 wrote

>gradient descent takes the direction of the minimum at the step size according to the taylor series of degree n at that point.

No. Gradient descent is first order by definition.

>in a lot of other optimization settings they do second order approx to find the optimal direction

It still isn't an "optimal" direction.

−3

kdub0 t1_ivs3z5e wrote

Since you're asking about theory, if you're optimizing a non-smooth convex function with a stochastic subgradient method, you usually require something like O(G^2/epsilon^2 + V/epsilon^2) iterations to get an epsilon accurate solution where G is a bound on the true subgradient norm and V is the variance of the estimator. If you use a batch size of B your variance is reduced by a factor of 1/B (under standard assumptions). So, if you run for T iterations your accuracy is roughly O(sqrt(G^2/T + V/(BT))). Often times in practice this bound is pessemistic, especially wrt B.

If your function is smooth, sometimes you can do better. Roughly you get a bound something like O(L/T + sqrt(V/T)) where L is the gradient's Lipschitz constant. Basically you can get a faster rate until you get close enough to the solution to be within the variance and at that point you have no choice but to average out the noise. It is usually the case that the variance term here dominates though even when you make efforts to reduce it. e.g., you need L >= sqrt(VT/B) for the smooth term to dominate, so B needs to be O(T).

There are a bunch of different variance reduction methods that can do better than simply averaging the gradients. There are some examples here https://www.cs.cmu.edu/~pradeepr/convexopt/Lecture_Slides/stochastic_optimization_methods.pdf that work when you are minimizing a function that is the sum of convex functions. They can sometimes be more expensive than just batching the gradient, but if your gradient is expensive to compute they might be useful. It is often pretty easy to come up with domain specific control variates too that can be useful to reduce variance.

3

name_not_acceptable t1_ivsi5i1 wrote

Wouldn't you be better off using the parallel computation to calculate gradient at lots of random points and then see if you can find a better local minima? Ie do they all converge to different points.

1

BitShin t1_ivsjvcr wrote

In addition to providing a smoother descend, computing the average gradient over many samples is very parallelizable. So on a modern GPU, taking the gradient at a single point vs 50 points is not 50 times more expensive. In fact, if your model is small enough, they could take roughly the same amount of time.

1

singularineet t1_ivsrvzj wrote

Yes, since the input point is uncertain due to measurement noise if nothing else, averaging over that distribution would be superior.

If you can average over other interesting distributions, like shifts or rotations or such for images, or even under a local approximation thereof, that would be amazing. Is that possible with quantum?

1

Spiritual-Reply5896 t1_iw8yhoi wrote

It gives you local improvement direction, but can we straightforwardly think about this metaphora of improvement in 3D and generalize it to thousands of dimensions?

Maybe its a little different question, but do you happen to know where to find research on this topic of generalizability of mathematical operations in interpretable geometrical dimensions to extremely high dimensions? Not looking for theory on vector spaces but on the intuitive aspects

1