ChuckSeven t1_j8t5r5m wrote on February 16, 2023 at 7:46 PM

Reply to comment by MustachedSpud in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

MustachedSpud t1_j8t65fh wrote on February 16, 2023 at 7:48 PM

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that