Submitted by madmax_br5 t3_10mbct5 in MachineLearning
madmax_br5 OP t1_j629re3 wrote
Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?
Viewing a single comment thread. View all comments