Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.
Mine’s in Tensorflow 2.11 — I’m sure writing a PyTorch version wouldn’t be hard. The extra lines of the algorithm are three lines in my paper. I can share my code though?
Oh damn, that paper almost does exactly what I do. Huh. Oh well. Slightly different implementation though. I in contrast, use both grads from the same timestep and have an accumulated Ct.
LahmacunBear t1_jdo7k0w wrote
Reply to [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.