kdub0 t1_ivs3z5e wrote on November 10, 2022 at 4:56 AM

Reply to [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point by CPOOCPOS

Since you're asking about theory, if you're optimizing a non-smooth convex function with a stochastic subgradient method, you usually require something like O(G^2/epsilon^2 + V/epsilon^2) iterations to get an epsilon accurate solution where G is a bound on the true subgradient norm and V is the variance of the estimator. If you use a batch size of B your variance is reduced by a factor of 1/B (under standard assumptions). So, if you run for T iterations your accuracy is roughly O(sqrt(G^2/T + V/(BT))). Often times in practice this bound is pessemistic, especially wrt B.

If your function is smooth, sometimes you can do better. Roughly you get a bound something like O(L/T + sqrt(V/T)) where L is the gradient's Lipschitz constant. Basically you can get a faster rate until you get close enough to the solution to be within the variance and at that point you have no choice but to average out the noise. It is usually the case that the variance term here dominates though even when you make efforts to reduce it. e.g., you need L >= sqrt(VT/B) for the smooth term to dominate, so B needs to be O(T).

There are a bunch of different variance reduction methods that can do better than simply averaging the gradients. There are some examples here https://www.cs.cmu.edu/~pradeepr/convexopt/Lecture_Slides/stochastic_optimization_methods.pdf that work when you are minimizing a function that is the sum of convex functions. They can sometimes be more expensive than just batching the gradient, but if your gradient is expensive to compute they might be useful. It is often pretty easy to come up with domain specific control variates too that can be useful to reduce variance.