Viewing a single comment thread. View all comments

baffo32 t1_jcronvh wrote

- offloading and accelerating (moving some parts to memory mapped disk or gpu ram, this can also make for quicker loading)

- pruning (removing parts of the model that didn’t end up impacting outputs after training)

- further quantization below 4 bits

- distilling to a mixture of experts?

- factoring and distilling parts out into heuristic algorithms?

- finetuning to specific tasks (e.g. distilling/pruning out all information related to non-relevant languages or domains) this would likely make it very small

EDIT:

- numerous techniques published in papers over the past few years

- distilling into an architecture not limited by e.g. a constraint of being feed forward

3

Art10001 t1_jcwfyw8 wrote

I heard MoE is bad. I have no sources sadly.

1

baffo32 t1_jcxqr2i wrote

i visited cvpr last year and people were saying that moe was what mostly was being used; i haven’t tried these things myself though

1