WigglyHypersurface

WigglyHypersurface OP t1_j5ldsn7 wrote

The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.

1

WigglyHypersurface OP t1_j5l3vlk wrote

I have - the whole point of my post is this limits information sharing across tokens, depending on the split.

So, for example, if the tokenizer splits the -ed off the end of a rare verb - like "refactored" but does not for a common verb, like "calmed" it splits representations for the verbal morphology into two, when really those -ed endings serve the same function.

5

WigglyHypersurface t1_iw7qykq wrote

Search for MIWAE and notMIWAE to find the papers on the technique.

If your data is small and tabular than you can't really beat bayes. If your data is too big for bayes but just tabular than random forest imputation is pretty good. Or if you have specific hypotheses you know you will test you can do mice with SMCFCS.

The real utility of the (M)IWAE I think is when you have non-tabular data with missings. This is my use case. I have to impute a mixture of audio, string, and tabular data.

2