KerfuffleV2 t1_jd57jq9 wrote on March 21, 2023 at 10:31 PM

Reply to comment by Lesterpaintstheworld in The internal language of LLMs: Semantically-compact representations by Lesterpaintstheworld

Be sure you're look at the number of tokens when you're considering conciseness, since that's what actually matters. I.E. an emoji may have a compact representation on the screen but that doesn't necessarily mean it'll be efficiently tokenized.

Just for example, "🧑🏾‍🚀" from one of the other comments actually is 11 tokens. The word "person" is just one token.

You can experiment here: https://platform.openai.com/tokenizer (non-OpenAI models likely will use a different tokenizer or tokenize text different, but that'll help you get an idea at least.)

Also relevant is that these models are trained to autocomplete text based on probabilities based on the text they were trained with. If you start using or asking them to generate text in a different format, it may well end up causing them to produce much lower quality answers (or understand less of what the user responded).

Lesterpaintstheworld OP t1_jd59bgp wrote on March 21, 2023 at 10:43 PM

Two very good consideration indeed, thanks :)