madmax_br5
madmax_br5 OP t1_j62d75y wrote
Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
I get that now, thanks! Not an ML expert so this is very helpful!
madmax_br5 OP t1_j62bm6c wrote
Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.
madmax_br5 OP t1_j62b2jq wrote
Reply to comment by float16 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.
So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?
madmax_br5 OP t1_j62anqr wrote
Reply to comment by CKtalon in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
What would be the practical impacts of a larger vocabulary? There seems to ultimately be no way around this if you want a truly multilingual model; your vocabulary needs to be at least as large as the full set of symbols in all the languages in the corpus. But it would seem that the computational costs of this would be limited to the very beginning and very end of the model, which seems computationally insignificant compared to the attention layers that operate in vector space. In fact, doesn't a larger input vocabulary result in fewer net tokens to vectorize in the first place? If the vector space of the embedding has a fixed dimensionality (which I believe it does in the case of GPT3), then isn't each token the same mathematical size once embedded?
madmax_br5 OP t1_j629re3 wrote
Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?
madmax_br5 OP t1_j625fr2 wrote
Reply to comment by ww3ace in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.
Submitted by madmax_br5 t3_10mbct5 in MachineLearning
madmax_br5 t1_j3ioy0f wrote
I don't know, but I found this awesome open source site for pruning and fine-tuning various models you may find interesting: https://sparsezoo.neuralmagic.com/
madmax_br5 t1_j3ik7wh wrote
Reply to comment by mandogbeer in [Project] Major drawback/limitation of GPT-3 by trafalgar28
It kind of depends on what the use case is. If it's simply to query against a large amount of information, you can just create embeddings of the information in chunks and add these together in a vector store index (https://gpt-index.readthedocs.io/en/latest/guides/index_guide.html). Then you embed your query using the same model and basically the relevant chunks are returned, and then you can synthesize a response from those chunks.
So let's say the use-case is to create a conversational tutor assistant for a textbook. Obviously, you can't put the whole textbook in the prompt. So you feed it in one paragraph at a time into the embeddings model, and store all these embeddings (along with the text they relate to) in a vector database like weaviate or pinecone. Then, when the user asks a question, you embed the query using the same embeddings model, and do a cosine similarity search using your vector database (a common function of vector DBs). And you say, return me the top 5 relevant chunks. Now you have some short context you can feed into normal GPT-3, with a prompt like "given the following context, create a bullet point summary" or "given the following context, create a simplified analogy using real-world examples."
Embeddings are basically the first half of the transformer. Language transformers essentially have two halves - the first half understands the input and encodes it into a set of numbers the model can understand. The second half takes that understanding and predicts a probable next word. When you think about this from a computational perspective, the first half only runs once, and the second half runs hundreds of times (once per output token). So you end up with only a fraction of a percent of the computation time spent on understanding (embedding) the input, and most of the time iteratively generating tokens. What semantic search in vector space lets you do is essentially compare items after only step 1, and THEN produce an output once you've gathered the necessary context. But of course you perform the embedding on your data ahead of time, so the only real compute that is needed at runtime is the embedding of the user's query, which is cheap.
madmax_br5 t1_j3iii5y wrote
Reply to comment by Bulky_Highlight_3352 in [Project] Major drawback/limitation of GPT-3 by trafalgar28
Actually GPT-index is a more robust framework for this, and plays well with langchain: https://github.com/jerryjliu/gpt_index
madmax_br5 OP t1_iw7a5vq wrote
Reply to comment by drive2fast in Milwaukee straight snips, bought in 2005 and used on hundreds of projects. Still sharp and tight. by madmax_br5
I thought their screw bits actually tested very well in project farm? but yes, definitely a “do your research” brand for hand tools
madmax_br5 OP t1_iw79xkp wrote
Reply to comment by diegobomber in Milwaukee straight snips, bought in 2005 and used on hundreds of projects. Still sharp and tight. by madmax_br5
Evidently this pair was one of the first to roll off the line; it’s possible they’ve cheapened them over time to save cost. I’m also surprised how long they lasted!
madmax_br5 OP t1_iw53fny wrote
Reply to comment by shredsickpow in Milwaukee straight snips, bought in 2005 and used on hundreds of projects. Still sharp and tight. by madmax_br5
Maybe I'm mistaken, but I definitely bought these long before I moved cross-country in 2011, Since I used them extensively on a loft buildout in 2010. Of that I can be sure. So it's possible I bought them in 2010 and got them confused with a previous pair I purchased earlier.
madmax_br5 t1_ircn0f5 wrote
Reply to Foam/paste that can be injected into a hole and when it hardens, it's as strong as hardwood? by flying-benedictus
Sounds like you may have metal studs. Try a self-tapping screw first and see if that grabs. Then patch the hole with high strength patch. If the self-tapping screw doesn't work, drill a larger hole and use a plywood backing piece as others have suggested, then patch the hole.
madmax_br5 OP t1_j63mi7f wrote
Reply to comment by suflaj in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Thank you, this is very helpful!