Submitted by WigglyHypersurface t3_10jka1r in MachineLearning
dojoteef t1_j5l399n wrote
This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information
For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).
WigglyHypersurface OP t1_j5l49mq wrote
Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.
Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.
Viewing a single comment thread. View all comments