Submitted by Angry_Grandpa_ t3_y92cl1 in singularity
visarga t1_it4lygh wrote
Reply to comment by ReasonablyBadass in A YouTube large language model for a scant $35 million. by Angry_Grandpa_
Visual data can be described in text, and maybe it's better to do so in order to avoid overfitting to irrelevant details. We have great captioning models for image and video, so we can use them together with speech recognition models. Just imagine a model trained on YT videos playing the sports commentator role - wouldn't it be great to have a virtual commentator for your vids?
But I am excited about training on massive video because it is special - it contains a trove of procedural knowledge, how to do things, step by step. That means you can finetune it later to automate anything you want. Your clumsy robot just got GPT-3 level smarts in practical tasks rarely described in words anywhere.
There was a recent paper - with just 5 hours of robot video and proprioception they trained a transformer to manipulate a toy kitchen and achieve tasks. Pretty amazing, considering The Wozniak threshold of AI: a robot enters a random kitchen and has to make a cup of coffee. There are millions of kitchens on YT, millions of everything in fact.
Looks like "learning to act" is going to be very successful, just like learning to generate text and images. Maybe the handymen won't be the last to be automated.
Viewing a single comment thread. View all comments