Viewing a single comment thread. View all comments

madmax_br5 OP t1_j62bm6c wrote

Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.

1

PassingTumbleweed t1_j62bzdk wrote

You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.

The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!

2

madmax_br5 OP t1_j62d75y wrote

I get that now, thanks! Not an ML expert so this is very helpful!

1