Viewing a single comment thread. View all comments

dojoteef t1_j5l399n wrote

This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information

For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).

4

WigglyHypersurface OP t1_j5l49mq wrote

Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.

Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.

1