Viewing a single comment thread. View all comments

terath t1_j5l8t4k wrote

Oh I see what you mean. I remember that there were some character level language models, but they fell out of favour for subwords as I think the accuracy difference wasn't enough to justify the extra compute required for the character level.

Reviewing the fast text approach, they still end up hashing the character-ngrams rather then training an embedding for each. This could introduce the same sorts of inconsistencies that you're observing. That said, the final fast text embeddings are already the sum of the character embeddings, so I'm not clear on how your approach is different than just using the final fast text embeddings.

3

WigglyHypersurface OP t1_j5ldsn7 wrote

The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.

1