dojoteef t1_j5l399n wrote on January 23, 2023 at 7:46 PM

This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information

For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).

WigglyHypersurface OP t1_j5l49mq wrote on January 23, 2023 at 7:53 PM

Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.

Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.