CKtalon

CKtalon t1_j13dg5b wrote

Just forget about it.

Yes, it's possible to do it on CPU/RAM (Threadripper builds with > 256GB RAM + some assortment of 2x-4x GPUs), but the speed is so slow that it's pointless working with it. Deepspeed or Hugging Face can spread it out between GPU and CPU, but even so, it will be stupid slow, probably MINUTES per token.

We are at least 5 years away before consumer hardware can run 175+B models on a single machine (4 GPUs in a single machine).

20B models are in the realm of consumer hardware (3090/4090) with INT8, though slow, but still possible.

73