RadioFreeAmerika OP t1_jdqgvof wrote on March 26, 2023 at 11:48 AM

There's something to it, but then they currently still fail at the simplest maths questions from time to time. So far, I didn't get a single LLM to correctly write me a sentence with eight words in it on first try. Most get it correct on the second try, though.

throwawaydthrowawayd t1_jdqisag wrote on March 26, 2023 at 12:09 PM

Remember, the text of an LLM is literally the thought process of the LLM. Trying to have it instantly write an answer to what you ask makes it nigh impossible to accomplish the task. Microsoft and OpenAI have said that the chatbot format degrades the AI's intelligence, but it's the format that is the most useful/profitable currently. If a human were to try to write a sentence with 8 words, they'd mentally retry multiple times, counting over and over, before finally saying an 8 word sentence. By using a chat format, the AI can't do this.

ALSO, the AI does not speak English. It gets handing a bunch of vectors, which do not directly correspond to word count, and it thinks about those vectors, before handing back a number. The fact these vectors + a number directly translate into human language doesn't mean it's going to have an easy time figuring out how many vectors add up to 8 words. That's just a really hard task for LLMs to learn.

RadioFreeAmerika OP t1_jdqky02 wrote on March 26, 2023 at 12:32 PM

Ah, okay, thanks. I have to look more into this vector-number representation.

For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again until it is confident it is right and only then display it? Ideally, with a time counter that at some point lets it just display what it has with a qualifier. Or if the confidence still is very low, just state that it doesn't know.

throwawaydthrowawayd t1_jdqqsur wrote on March 26, 2023 at 1:28 PM

> For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again

You can! There are systems designed around that. OpenAI even internally had GPT-4 using a multi-stage response system (a read-execute-print loop, they called it) while testing, to give it more power. There is also the "Reflexion" posts on this sub lately, where they have GPT-4 improve on its own writing. But, A, it's too expensive. Using a reflective system means lots of extra words, and each word costs more electricity.

And B, LLMs currently love to get sidetracked. They use the word "hallucinations" to say that the LLM just starts making things up, or acting like you asked a different question, or many other things. Adding an internal thought process dramatically increases the chances of LLMs going off the rails. There are solutions to this (usually, papers on it will describe their solutions as "grounding" the AI), but once again, they cost more money to do.

So that's why all these chatbots aren't as good as they could be. it's just not worth the electricity to them.

RadioFreeAmerika OP t1_jdr46f0 wrote on March 26, 2023 at 3:14 PM

Very insightful! Seems like even without groundbreaking stuff, more efficient hardware will likely make the solutions you mentioned more feasible in the future.

turnip_burrito t1_jdsoxo1 wrote on March 26, 2023 at 9:56 PM

Yeah, we're really waiting for electricity costs to fall if we want to implement things like this in reality.

Right now the roughly current rate of $0.10/(1000tokens)/minute/LLM will, per hour, cost us $6 per hour to run a single LLM. If you have some ensemble of LLMs checking each other's work and working in parallel, say 10 LLMs, that's $60/hr, or $1440/day. Yikes, I can't afford that. And that will maybe have performance and problem solving somewhere between a single LLM and one human.

Once the cost falls by a factor of 100, that's $14.40/day. Expensive, but much more reasonable.

RadioFreeAmerika OP t1_jdufzz4 wrote on March 27, 2023 at 7:53 AM

But even with $60/h, this might already be profitable if you replace a job that has a higher hourly wage. Lawyers, e.g. At 14.4/h, you beat minimum wage. For toying around, yeah, that's a bit expensive.

turnip_burrito t1_jduhcoa wrote on March 27, 2023 at 8:14 AM

Yeah for an individual it's no joke .

For a business it may be worth it, depending on the job.

turnip_burrito t1_jdqhcoi wrote on March 26, 2023 at 11:53 AM

I'd have trouble making a sentence with 8 words in one try too if you just made me blast words out of my mouth without letting me stop and think.

I don't think this is a weakness of the model, basically. Or if it is, then we also share it.

The key is if you think about how you as a person approach the problem of making a sentence with 8 words, you will see how to design a system where the model can do it too.

RadioFreeAmerika OP t1_jdqlcsd wrote on March 26, 2023 at 12:37 PM

I also don't think it is a weakness of the model, just a current limitation I didn't expect from my quite limited knowledge about LLMs. I am trying to gain some more insights.

FoniksMunkee t1_jdqs9x9 wrote on March 26, 2023 at 1:41 PM

It's a limitation of LLM's as they currently stand. They can't plan ahead, and they can't backtrack.

So a human doing a problem like this would start, see where they get to, perhaps try something else. But LLM's can't. MS wrote a paper on the state of ChatGPT4 and they made this observation about why LLM's suck at math.

"Second, the limitation to try things and backtrack is inherent to the next-word-prediction paradigm that the model operates on. It only generates the next word, and it has no mechanism to revise or modify its previous

output, which makes it produce arguments “linearly”. "

They argue too that the model was probably not trained on as much mathematical data as code - and more training will help. But they also said the issue above "...constitutes a more profound limitation.".

turnip_burrito t1_jdqrxre wrote on March 26, 2023 at 1:38 PM

To be fair, the model does have weaknesses. Just this particular one maybe has a workaround.

shillingsucks t1_jdrjmcc wrote on March 26, 2023 at 5:04 PM

Not typing with any sort of confidence but just musing.

Couldn't it be said that humans cheat mentally as well for this type of task? As in I am not aware of anyone who knows how a sentence that they are thinking or speaking will end while they are in the middle of it. For us we would need to make a mental structure that needs to be filled and then come up with the sentence that matched the framework.

If the AI often gets it right on the 2nd try it makes me wonder if there is a way to frame the question initially where they would have the right framework to get it right on the first guess.

Cryptizard t1_jdqtbnd wrote on March 26, 2023 at 1:49 PM

It's really not. Just pick any two large numbers and ask it to multiply them. It will get the first couple digits of the result right but then it just goes off the rails.

turnip_burrito t1_jdse82g wrote on March 26, 2023 at 8:39 PM

I've done this like 8 or 9 times with crazy things like 47t729374^3 /37462-736262636^2 /374 and it has gotten them all exactly right or right to 4 or 7 sig figs (always due to rounding whicj it acknowledges).

Maybe I just got lucky 8 or 9 times in a row.

Cryptizard t1_jdsg57p wrote on March 26, 2023 at 8:53 PM

How does "exactly right" square with "4 sig figs." That's another way of saying wrong.

turnip_burrito t1_jdsninw wrote on March 26, 2023 at 9:46 PM

Why even point this out?

If you reread my reply, you would see I said "exactly right OR right to 4 or 7 sig figs". I didn't say 4 or 7 sig figs was exactly right. I'm going to give you the benefit of the doubt and assume you just misread the reply.

Cryptizard t1_jdsooyh wrote on March 26, 2023 at 9:54 PM

I'm sorry, from my perspective here is how our conversation went:

You: GPT4 is really good at arithmetic.

Me: It's not though, it gets multiplication wrong for any number with more than a few digits.

You: I tried it a bunch and it gets it the first few numbers right.

Me: Yeah but the first few numbers right is not right. It is wrong. Like I said.

You can't claim you are good at math if you only get a few significant digits of a calculation right. That is not good at math. It is bad at math. I feel like I am taking crazy pills.

turnip_burrito t1_jdspnv6 wrote on March 26, 2023 at 10:01 PM

It's good at math, it just has a rounded answer.

Most of the time it was actually absurdly accurate (0.0000001% error), and the 4 sig fig rounding only happened once or twice.

It is technically wrong. But so is a calculator's answer. The calculator cannot give an exact decimal representation either. So is it bad at math?

Cryptizard t1_jdsq1sy wrote on March 26, 2023 at 10:04 PM

No, I'm sorry, you are confused my dude. Give two 6 digit numbers to multiply and it only gets the first 3-4 digits correct. That is .1-1% error. I just did it 10 times and it is the same every time.

turnip_burrito t1_jdsqq3f wrote on March 26, 2023 at 10:10 PM

I just tried a couple times now and you're right. That's weird.

When I tried these things about a week and a half ago, it did have the performance I found. Either I got lucky or something changed.

Why is maths so hard for LLMs?

turnip_burrito t1_jdqgloh wrote on March 26, 2023 at 11:44 AM