Viewing a single comment thread. View all comments

LetterRip t1_j3n91mt wrote

I'd do GLM-130B

> With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation.

https://github.com/THUDM/GLM-130B

I'd also look into pruning/distillation and you could probably shrink the model by about half again.

2