Submitted by WigglyHypersurface t3_10jka1r in MachineLearning
One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this?
dojoteef t1_j5l399n wrote
This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information
For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).