One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this?

Comments

dojoteef t1_j5l399n wrote on January 23, 2023 at 7:46 PM

This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information

For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).

WigglyHypersurface OP t1_j5l49mq wrote on January 23, 2023 at 7:53 PM

Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.

Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.

terath t1_j5kz6tz wrote on January 23, 2023 at 7:21 PM

Have you not heard of byte pair encoding? There are plenty of subword tokenizers and many language models are built on them.

Here is a quick article on them: https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0

WigglyHypersurface OP t1_j5l3vlk wrote on January 23, 2023 at 7:50 PM

I have - the whole point of my post is this limits information sharing across tokens, depending on the split.

So, for example, if the tokenizer splits the -ed off the end of a rare verb - like "refactored" but does not for a common verb, like "calmed" it splits representations for the verbal morphology into two, when really those -ed endings serve the same function.

terath t1_j5l8t4k wrote on January 23, 2023 at 8:21 PM

Oh I see what you mean. I remember that there were some character level language models, but they fell out of favour for subwords as I think the accuracy difference wasn't enough to justify the extra compute required for the character level.

Reviewing the fast text approach, they still end up hashing the character-ngrams rather then training an embedding for each. This could introduce the same sorts of inconsistencies that you're observing. That said, the final fast text embeddings are already the sum of the character embeddings, so I'm not clear on how your approach is different than just using the final fast text embeddings.

WigglyHypersurface OP t1_j5ldsn7 wrote on January 23, 2023 at 8:52 PM

The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.

suflaj t1_j5mnzq3 wrote on January 24, 2023 at 2:07 AM

Why would this matter?

If such examples are present in the training set and adequately expressed, then the model will learn whatever it needs to learn from those words.

If they are not in the training set, you should not expect the model to understand them the same way you do.

I realize this defeats the point of generalization, but LLMs learn to mimic generalization through exposure, not by actually learning to understand the underlying principles. These models do not analyze text like we humans do, but they have been shown to outperform the average human despite that.

Ultimately to do what you are doing you would need to have a tokenizer that has all the syntactical knowledge embedded within itself for a given subset of the language that will be the input. Wasn't AlexNet, a decade ago, enough to convince you to always relegate these kinds of tasks to the DL model, which will always beat a human provided it has the capacity and the data?