Viewing a single comment thread. View all comments

jamesj OP t1_ja2acnm wrote

Hey I appreciate the time to engage with the article and provide your thoughts. I'll respond to a few things.

>The first two elements of that is the definition for any model, which is exactly what both AI and deterministic regression algorithms all do.

Yes, under the framework used in the article, an agent using linear regression might be a little intelligent. It can take past state data and use it to make predictions about the future state, and use those predictions to act. That would be more intelligent than an agent which makes random actions.

>I'm not saying it's a bad paper or theory, but that this essay doesn't really justify why it brings it up so much

Yes, that is a fair point. I was worried that spending more time on it would have made it even longer than it already was. But one justification is that it is a good, practical, definition of intelligence that demystifies the process of intelligence to what kind of information processing must be taking place. It is built off of information theory work in information bottlenecks, and is directly related to the motivation for autoencoders.

>The problem is that Schmidhuber 2008 only exists as a preprint and later as a conference paper -- it was never peer-reviewed.

The paper isn't an experiment with data, it was first presented at a conference to put forward an interpretation. It's been cited 189 times. I think it is worth reading, the ideas can be understood pretty easily. But it isn't the only paper that discusses the connection between compression, prediction, and intelligence. Not everyone talks in the language of compression, they may use words like elegance, parameter efficiency, information bottlenecks, or whatever, but we are talking about the same ideas. This paper has some good references, it states, "Several authors [1,5,6,11,7,9] have suggested the relevance of compression to intelligence, especially the inductive inferential (or inductive learning) part of intelligence. M. Hutter even proposed a compression contest (the Hutter prize) which was “motivated by the fact that being able to compress well is closely related to acting intelligently”

>The equation E = mc2 For the newbies out there, this is what's called a red flag.

I was trying to use an example that people would be familiar with. All the example is pointing out is that the equations of physics are highly compressed representations of the data of past physical measurements, that allow us to predict lots of future physical measurements. That could be said of Maxwell's equations or the Standard Model or any successful physical theory. Most physicists like more compressed mathematical descriptions: though they usually would call it more elegant rather than use the language of compression.

>This is completely the wrong way to think about it if you're trying to understand these things, so I hope he actually knows this.

I don't think it is wrong to say that what the transformer "knows" about the images in its dataset has been compressed into its weights. In a very real sense, a transformer is very lossy compression algorithm which takes in a huge dataset and learns weights which represent patterns in the dataset. So no, I'm not saying that literally every image in the dataset was compressed down to 1.2 bytes each. I'm saying that whatever SD learned about the relationships of the pixels in an image to their text labels is stored in 1.2 bytes per dataset image in its weights. And you can actually use those weights as a good image compression codec. The fact that it has to do this in a limited number of parameters is one of the things that forces it to learn higher-level patterns and not rely on memorization or other simpler strategies. Illya Sutskever talks about this, and was part of a team that published on it, basically showing that there is a sweet spot for data/parameter where giving it more parameters improves performance to a point, but there is a point where adding even more decreases performance. His explanation for this is that by limiting the number of parameters, the model is forced to generalize. So in Schmidhubers language, the network is forced to make more compressed representations, so it overfits less and generalizes better.

>First, this is the connectivist problem/fallacy in early AI and cog sci -- the notion that because small neuronal systems could be emulated somewhat with neural nets, and because neural nets could do useful biological-looking things, that then the limiting factor to intelligence/ability is simple scale

My argument about this doesn't come from ML systems mimicking biology. It comes from looking at exponential graphs of cost, performance, model parameters, and so on, and projecting that exponential growth will likely continue for a while. The first airplane didn't fly like a bird, it did something a lot simpler than that. In the same way, I'd bet the first AGI will be a lot simpler than a brain. I could be wrong about that.

But, I'm not even claiming that scaling transformers will lead to AGI, or that AGI will definitely be developed soon. All I'm saying is that there is significant expert uncertainty in when AGI will be developed, and it is possible that it could be developed soon. If it were, that would probably be the most difficult type of AGI to align, which is a concern.

2