PassingTumbleweed t1_j62anc3 wrote on January 27, 2023 at 5:00 AM

You could solve the problem you describe at the tokenization level without moving away from Unicode, which is more about how text is encoded for storage and transmission purposes.

For example let's say you still represent your text as Unicode at rest, but you have a tokenizer that budgets its vocab space s.t. the average number of tokens per sentence is the same across languages (or whatever your fairness criteria is)

madmax_br5 OP t1_j62bm6c wrote on January 27, 2023 at 5:10 AM

Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.

PassingTumbleweed t1_j62bzdk wrote on January 27, 2023 at 5:13 AM

You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.

The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!

madmax_br5 OP t1_j62d75y wrote on January 27, 2023 at 5:25 AM

I get that now, thanks! Not an ML expert so this is very helpful!