Viewing a single comment thread. View all comments

ww3ace t1_j624na0 wrote 2 years ago

I don’t think any modern SOTA language model uses Unicode for tokenization.

madmax_br5 OP t1_j625fr2 wrote 2 years ago

The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.