Viewing a single comment thread. View all comments

cameldrv t1_jdldgo8 wrote

I think that fine tuning has its place, but I don't think you're going to be able to replicate the results of a 175B parameter model with a 6B one, simply because the 100B model empirically just holds so much information.

If you think about it from an information theory standpoint, all of that specific knowledge has to be encoded in the model. If you're using 8 bit weights, your 6B parameter model is 6GB. Even with incredible data compression, I don't think that you can fit anywhere near the amount of human knowledge that's in GPT-3.5 into that amount of space.

10

Vegetable-Skill-9700 OP t1_jdpr15o wrote

I agree that 175B model will always perform better than 6B model on general tasks, so, maybe that is a great model for demos. But as you build product on top on this model which is used in a certain way and satisfies a certain usecase, won't it make sense to use a smaller model and fine-tune on the relevant dataset?

1