granddaddy OP t1_j051ykd wrote on December 14, 2022 at 2:58 AM

Reply to comment by rafgro in [D] Getting around GPT-3's 4k token limit? by granddaddy

I'm having a hard time wrapping my head around this. Do you think you could elaborate further? Do you have a github repo by chance?

rafgro t1_j05enj4 wrote on December 14, 2022 at 4:45 AM

Example tokenizer: https://github.com/josephrocca/gpt-2-3-tokenizer, in the most vanilla version you could count occurrences of tokens from the question/task in the document and jump to that place, eg. if the task is about lung cancer, jump to the book chapter with most occurrences of "lung" and "cancer". It works fine enough but you can make it more robust by building a timid scoring system (eg. higher weight assigned to lung than to cancer), finding words related to task words with word2vector equivalent and looking for them with appropriate weights as well, or even splicing a few different places with high score into one prompt.

granddaddy OP t1_j05pe9j wrote on December 14, 2022 at 6:34 AM

Very helpful. Appreciate the link. Is that your repo?

rafgro t1_j05v581 wrote on December 14, 2022 at 7:46 AM

No, I think it's a fork of AI Dungeon's encoder.