madmax_br5 OP t1_j62bm6c wrote on January 27, 2023 at 5:10 AM

Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.

PassingTumbleweed t1_j62bzdk wrote on January 27, 2023 at 5:13 AM

You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.

The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!

madmax_br5 OP t1_j62d75y wrote on January 27, 2023 at 5:25 AM

I get that now, thanks! Not an ML expert so this is very helpful!