Submitted by madmax_br5 t3_10mbct5 in MachineLearning
ww3ace t1_j624na0 wrote
I don’t think any modern SOTA language model uses Unicode for tokenization.
madmax_br5 OP t1_j625fr2 wrote
The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.
Viewing a single comment thread. View all comments