KerfuffleV2
KerfuffleV2 t1_jaiz1k8 wrote
Reply to comment by bo_peng in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Unfortunately, that doesn't work on the old reddit layout. We just see a garbled mess.
Here's a fixed version of the code/examples:
(not my content)
Example:
'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32'
= first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32
'cuda fp16 *20+'
= first 20 layers on cuda fp16, then stream the rest on it
os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' # if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV
from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV
# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')
ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
print(s, end='', flush=True)
# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
alpha_frequency = 0.25,
alpha_presence = 0.25,
token_ban = [0], # ban the generation of some tokens
token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)
I kind of want to know what happens in the story...
KerfuffleV2 t1_jboquv7 wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
If it helps, I was able to get the 7B model going on a GTX 1060 with 6GB VRAM also. The strategy I used was
cuda fp16i8 *16 -> cpu fp32
— starting out with about 1.2G vram already in use from other programs and desktop environment, it went up to about 5.6G which would be about 0.275G/layer. So on a 6GB card withfp16i8
it seems like even with totally free VRAM you could load 21, maybe 22 layers at the maximum and half that for the normalfp16
format. This model:RWKV-4-Pile-7B-20230109-ctx4096
It generates a token every 2-3sec which is is too slow for interactive use but still pretty impressive considering the model size and how old the hardware is (my CPU is just a Ryzen 5 1600 also). It's also running half the layers on the CPU. By the way, it also uses about 14GB RAM to run, so you'll need a decent amount of system memory available as well.
Tagging /u/bo_peng also in case this information is helpful for them. (One interesting thing I noticed is the GPU was only being used about 50% of the time, I guess while the CPU inference was run. I don't know if it's possible, but if there was some way to do both in parallel it seems like it would roughly double the speed of token generation.)