catch23
catch23 t1_j9cd5tw wrote
Reply to comment by EuphoricPenguin22 in [D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM by head_robotics
it does look to be 20-100x slower for those huge models, but still bearable if you're the only user on the machine. Still better than nothing if you don't have lots of GPU memory.
catch23 t1_j9b9upb wrote
Reply to [D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM by head_robotics
Could try something like this: https://github.com/Ying1123/FlexGen
This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).
catch23 t1_j9dxlze wrote
Reply to comment by smallfried in [D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM by head_robotics
Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.