visarga t1_j0tz7zh wrote
Reply to comment by Accomplished_Diver86 in Prediction: De-facto Pure AGI is going to be arriving next year. Pessimistically in 3 years. by Ace_Snowlight
What current AIs are lacking is a playground. The AI playground needs to have games, simulations, code execution, databases, search engines, other AIs. Using them the AI would get to work on solving problems. Initially we collect and then we generate more and more problems - coding, math, science, anything that can be verified. We add the experiments to the training set and retrain the models. We make models that can invent new tasks, solve them, evaluate the solution for errors and significance, and do this completely on their own, using just electricity and GPUs.
Why? This will add into the mix something the AI lacks - experience. AI is well read but has no experience. If we allow the model to collect its own experience then it would be a different thing. For example, after training on a thousand tasks, GPT-3 learned to solve any task at first sight, and after training on code it learned multi-step reasoning (chain of thought). Both of these - supervised multi task data and code are collections of solved problems, samples of experience.
Ace_Snowlight OP t1_j0tzems wrote
Yea we are lacking on so many fronts, but tools already exist and are improving at a super-fast rate (maybe not accessible to everyone) that can be used to start meeting these requirements in an impressive way even if being far from good-enough
Tyanuh t1_j0vwmsi wrote
This is an interesting thought.
I would say what it also lacks is the ability to associate information about a concept through multiple "senses".
Once AI gets the ability to associate visual input with verbal input for example, you will slowly build up a network of connections that is, in a sense, embodied, and actually connected to 'being' in an ontologicsl sense.
visarga t1_j0whrvn wrote
Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.
Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).
They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.
The BLIP model can even generate captions - used to fix low quality captions in the training set.
Viewing a single comment thread. View all comments