Viewing a single comment thread. View all comments

danilo62 OP t1_jeextek wrote

Oh, so those models are able to produce fixed size embeddings of texts? I wasn't aware of that

2

-pkomlytyrg t1_jef4weq wrote

Generally, yes. If you use a model with a long context length (BigBird or OpenAI’s ada02), you’ll likely be fine unless the articles you’re embedding are greater than the token limit. If your using BERT or another, smaller model, you have to chunk/average; that can produce fixed sized vectors but you gotta put the work in haha

1

danilo62 OP t1_jefnror wrote

Yeah I'm gonna try both options (with BERT and the bigger models) but since I'm working with a big dataset I'm not sure I'll be able to use the larger models due to the token and request limits. Thanks for the help

1