My suggestion is using 8 bit or 4 bit quantization. Also you can using automatic device mapping on Transformers that can offload partially to your CPU (warning : It use lots of System Memory [RAM]).
Unless you plan on quantizing your model or loading it layer by layer, I'm afraid 2B parameters is the most you'll get. 10GB VRAM is not really enough for CV nowadays, let alone NLP. With quantization, you can barely run the 7B model.
4 bit doesn't matter at the end of the day since it's not supported out of the box, unless you intend to implement it yourself.
ThisIsMyStonerAcount t1_jdeqmjc wrote
What's your end goal?