catch23 t1_j9b9upb wrote on February 20, 2023 at 5:44 PM

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

smallfried t1_j9dtyf7 wrote on February 21, 2023 at 4:45 AM

That is very interesting!

The paper is not yet on GitHub, but I'm assuming the hardware requirements are as mentioned one beefy consumer GPU (3090) and a whole bunch of DRAM (>210GB) ?

I've played with opt-175b and with a bit of twiddling it can actually generate some Python code :)

This is very exciting as it gets these models into the prosumer range hardware!

catch23 t1_j9dxlze wrote on February 21, 2023 at 5:21 AM

Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.

EuphoricPenguin22 t1_j9c51t7 wrote on February 20, 2023 at 9:06 PM

Does that increase inference time?

catch23 t1_j9cd5tw wrote on February 20, 2023 at 9:59 PM

it does look to be 20-100x slower for those huge models, but still bearable if you're the only user on the machine. Still better than nothing if you don't have lots of GPU memory.

EuphoricPenguin22 t1_j9ceqy4 wrote on February 20, 2023 at 10:10 PM

Yeah, and DDR4 DIMMs are fairly inexpensive as compared to upgrading a GPU for more VRAM.

luaks1337 t1_j9cajyf wrote on February 20, 2023 at 9:42 PM

Yes, at least if I read the documentation correctly.