catch23

catch23 t1_j9b9upb wrote

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

56