Submitted by RadioFreeAmerika t3_122ilav in singularity
RadioFreeAmerika OP t1_jdqgvof wrote
Reply to comment by turnip_burrito in Why is maths so hard for LLMs? by RadioFreeAmerika
There's something to it, but then they currently still fail at the simplest maths questions from time to time. So far, I didn't get a single LLM to correctly write me a sentence with eight words in it on first try. Most get it correct on the second try, though.
throwawaydthrowawayd t1_jdqisag wrote
Remember, the text of an LLM is literally the thought process of the LLM. Trying to have it instantly write an answer to what you ask makes it nigh impossible to accomplish the task. Microsoft and OpenAI have said that the chatbot format degrades the AI's intelligence, but it's the format that is the most useful/profitable currently. If a human were to try to write a sentence with 8 words, they'd mentally retry multiple times, counting over and over, before finally saying an 8 word sentence. By using a chat format, the AI can't do this.
ALSO, the AI does not speak English. It gets handing a bunch of vectors, which do not directly correspond to word count, and it thinks about those vectors, before handing back a number. The fact these vectors + a number directly translate into human language doesn't mean it's going to have an easy time figuring out how many vectors add up to 8 words. That's just a really hard task for LLMs to learn.
RadioFreeAmerika OP t1_jdqky02 wrote
Ah, okay, thanks. I have to look more into this vector-number representation.
For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again until it is confident it is right and only then display it? Ideally, with a time counter that at some point lets it just display what it has with a qualifier. Or if the confidence still is very low, just state that it doesn't know.
throwawaydthrowawayd t1_jdqqsur wrote
> For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again
You can! There are systems designed around that. OpenAI even internally had GPT-4 using a multi-stage response system (a read-execute-print loop, they called it) while testing, to give it more power. There is also the "Reflexion" posts on this sub lately, where they have GPT-4 improve on its own writing. But, A, it's too expensive. Using a reflective system means lots of extra words, and each word costs more electricity.
And B, LLMs currently love to get sidetracked. They use the word "hallucinations" to say that the LLM just starts making things up, or acting like you asked a different question, or many other things. Adding an internal thought process dramatically increases the chances of LLMs going off the rails. There are solutions to this (usually, papers on it will describe their solutions as "grounding" the AI), but once again, they cost more money to do.
So that's why all these chatbots aren't as good as they could be. it's just not worth the electricity to them.
RadioFreeAmerika OP t1_jdr46f0 wrote
Very insightful! Seems like even without groundbreaking stuff, more efficient hardware will likely make the solutions you mentioned more feasible in the future.
turnip_burrito t1_jdsoxo1 wrote
Yeah, we're really waiting for electricity costs to fall if we want to implement things like this in reality.
Right now the roughly current rate of $0.10/(1000tokens)/minute/LLM will, per hour, cost us $6 per hour to run a single LLM. If you have some ensemble of LLMs checking each other's work and working in parallel, say 10 LLMs, that's $60/hr, or $1440/day. Yikes, I can't afford that. And that will maybe have performance and problem solving somewhere between a single LLM and one human.
Once the cost falls by a factor of 100, that's $14.40/day. Expensive, but much more reasonable.
RadioFreeAmerika OP t1_jdufzz4 wrote
But even with $60/h, this might already be profitable if you replace a job that has a higher hourly wage. Lawyers, e.g. At 14.4/h, you beat minimum wage. For toying around, yeah, that's a bit expensive.
turnip_burrito t1_jduhcoa wrote
Yeah for an individual it's no joke .
For a business it may be worth it, depending on the job.
turnip_burrito t1_jdqhcoi wrote
I'd have trouble making a sentence with 8 words in one try too if you just made me blast words out of my mouth without letting me stop and think.
I don't think this is a weakness of the model, basically. Or if it is, then we also share it.
The key is if you think about how you as a person approach the problem of making a sentence with 8 words, you will see how to design a system where the model can do it too.
RadioFreeAmerika OP t1_jdqlcsd wrote
I also don't think it is a weakness of the model, just a current limitation I didn't expect from my quite limited knowledge about LLMs. I am trying to gain some more insights.
FoniksMunkee t1_jdqs9x9 wrote
It's a limitation of LLM's as they currently stand. They can't plan ahead, and they can't backtrack.
So a human doing a problem like this would start, see where they get to, perhaps try something else. But LLM's can't. MS wrote a paper on the state of ChatGPT4 and they made this observation about why LLM's suck at math.
"Second, the limitation to try things and backtrack is inherent to the next-word-prediction paradigm that the model operates on. It only generates the next word, and it has no mechanism to revise or modify its previous
output, which makes it produce arguments “linearly”. "
They argue too that the model was probably not trained on as much mathematical data as code - and more training will help. But they also said the issue above "...constitutes a more profound limitation.".
turnip_burrito t1_jdqrxre wrote
To be fair, the model does have weaknesses. Just this particular one maybe has a workaround.
shillingsucks t1_jdrjmcc wrote
Not typing with any sort of confidence but just musing.
Couldn't it be said that humans cheat mentally as well for this type of task? As in I am not aware of anyone who knows how a sentence that they are thinking or speaking will end while they are in the middle of it. For us we would need to make a mental structure that needs to be filled and then come up with the sentence that matched the framework.
If the AI often gets it right on the 2nd try it makes me wonder if there is a way to frame the question initially where they would have the right framework to get it right on the first guess.
Viewing a single comment thread. View all comments