KerfuffleV2 t1_jboquv7 wrote on March 10, 2023 at 3:54 PM

Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

If it helps, I was able to get the 7B model going on a GTX 1060 with 6GB VRAM also. The strategy I used wascuda fp16i8 *16 -> cpu fp32 — starting out with about 1.2G vram already in use from other programs and desktop environment, it went up to about 5.6G which would be about 0.275G/layer. So on a 6GB card with fp16i8 it seems like even with totally free VRAM you could load 21, maybe 22 layers at the maximum and half that for the normal fp16 format. This model: RWKV-4-Pile-7B-20230109-ctx4096

It generates a token every 2-3sec which is is too slow for interactive use but still pretty impressive considering the model size and how old the hardware is (my CPU is just a Ryzen 5 1600 also). It's also running half the layers on the CPU. By the way, it also uses about 14GB RAM to run, so you'll need a decent amount of system memory available as well.

Tagging /u/bo_peng also in case this information is helpful for them. (One interesting thing I noticed is the GPU was only being used about 50% of the time, I guess while the CPU inference was run. I don't know if it's possible, but if there was some way to do both in parallel it seems like it would roughly double the speed of token generation.)

KerfuffleV2 t1_jaiz1k8 wrote on March 1, 2023 at 7:48 PM

Reply to comment by bo_peng in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Unfortunately, that doesn't work on the old reddit layout. We just see a garbled mess.

Here's a fixed version of the code/examples:

(not my content)

Example:

'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32' = first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32

'cuda fp16 *20+' = first 20 layers on cuda fp16, then stream the rest on it

os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' #  if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV

from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV

# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')

ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in     Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
    print(s, end='', flush=True)

# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
                     alpha_frequency = 0.25,
                     alpha_presence = 0.25,
                     token_ban = [0], # ban the generation of some tokens
                     token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)

I kind of want to know what happens in the story...