nic333rice
nic333rice t1_iqw529i wrote
Interesting data! I’m a bit skeptical about the graph for Chinese language. It suggests that on average 95% of a book can be understood if one knows 10000 Chinese words. 95% seems a bit high to me. Is it possible that the analysis only took Chinese characters into account?
In Chinese, words are comprised of characters. So multiple words share the same characters. Thus, one might be familiar with all the characters a word is comprised of, but may not know the meaning of the word/the combination of characters.
Edit: I want to add that in Chinese writing there is no space between words like there is in English, so it is not as trivial to find the boundaries between words
nic333rice t1_iqwd1f8 wrote
Reply to comment by orgtre in The returns to learning the most common words, by language [OC] by orgtre
Ahhh so it was tokenized. That’s nice to hear. Thanks for the elaborate answer! :)