Submitted by madmax_br5 t3_10mbct5 in MachineLearning
madmax_br5 OP t1_j62bm6c wrote
Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.
PassingTumbleweed t1_j62bzdk wrote
You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.
The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!
madmax_br5 OP t1_j62d75y wrote
I get that now, thanks! Not an ML expert so this is very helpful!
Viewing a single comment thread. View all comments