LahmacunBear t1_jdo7k0w wrote on March 25, 2023 at 9:55 PM

Reply to [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.

LahmacunBear OP t1_j3mrexi wrote on January 9, 2023 at 5:27 PM

Reply to comment by SatoshiNotMe in [R] Learning Learning-Rates: SteDy Optimizer by LahmacunBear

Mine’s in Tensorflow 2.11 — I’m sure writing a PyTorch version wouldn’t be hard. The extra lines of the algorithm are three lines in my paper. I can share my code though?

LahmacunBear OP t1_j3l5ub2 wrote on January 9, 2023 at 8:54 AM

Reply to comment by resented_ape in [R] Learning Learning-Rates: SteDy Optimizer by LahmacunBear

Oh damn, that paper almost does exactly what I do. Huh. Oh well. Slightly different implementation though. I in contrast, use both grads from the same timestep and have an accumulated Ct.