ChuckSeven t1_j8t5r5m wrote
Reply to comment by MustachedSpud in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.
MustachedSpud t1_j8t65fh wrote
Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that
Viewing a single comment thread. View all comments