madmax_br5 OP t1_j629re3 wrote on January 27, 2023 at 4:52 AM

Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5

Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?