SkyeandJett t1_jee2yc5 wrote
Reply to comment by Relevant_Ad7319 in Language Models can Solve Computer Tasks (by recursively criticizing and improving its output) by rationalkat
I don't want to stay that's trivial but it is easily solved. However that's more or less irrelevant. GUIs are for humans. GPT accesses things directly through a CLI API. This paper more or less confirms what everyone else has been saying and experimenting with. GPT-4 might not be AGI, but enhanced with memory, chain of thought, task generation and prioritization, self-checking and correction, etc. it probably is. Now give it access to tools, things like TaskMatrix coming soon and frankly it becomes an extremely powerful autonomous agent. You tell it what you need and it just...does it. This is all going to come together very quickly. Then drop an immensely more powerful core into the system, i.e. GPT-5 and things start getting stupid.
Itchy-mane t1_jeece6h wrote
I literally sold all my agix coins after seeing taskmatrix. Shit looks revolutionary when paired with gpt 4
Relevant_Ad7319 t1_jee3l13 wrote
But not everything has an API. I think we need GPT to simulate mouse and keyboard inputs like a human in order to automate everything what a human can do on a computer
EDIT: No idea why I get downvoted for this 🤷♂️ This sub is strange
falldeaf t1_jeehm0d wrote
I bet it will be possible with the multi modal version! Essentially just give it access to the ability to take screenshots and an API for choosing mouse position. It'd be interesting to know if that could work in a one shot fashion.
WonderFactory t1_jeg6rye wrote
It's too slow at inference for something like that. It's probably far easier to do it the other way around. If you want your software to interface with GPT 4 build in some sort of scripting interface to your app
falldeaf t1_jeg8xaf wrote
It would be slower, but I'd disagree that it's too slow for that to work. In fact, I bet it could write something like autohotkey scripts to accomplish what it needs to do. You wouldn't have to have video and slowly move your mouse across the screen. You could get a screenshot, figure out where to move the mouse, then move the mouse to those coordinates and press left mouse button, take a screenshot to confirm the app is open, etc.
Having said that, anything that can be accomplished by opening a terminal should just be done there as it would be faster. In the short term though, there's lots of applications that are designed for humans that it would be great for LLM's to be able to interface with. Maybe in the long term they'll just write their own applications to accomplish something we'd normally need a gui for. Maybe there will be interfaces that have a human viewable component but most of the controls will gone. Like imagine a 3D modelling application that just has a viewer with just a few buttons to move the view around (It'll be easier to just spin the object to an angle yourself then say it.) But you'll have pointing and painting tools to help collaborate with the AI. ::draw a circle around a part of the mesh:: Make this area a little rougher. ::point to a leg, then draw a line coming out in a curve:: Have a tooth-like spike come out right here. Etc.
It'll be neat to see where this all goes, I suspect that UIs will radically change but in the near-term I'm sure there will be stop-gaps using current tech, too.
CommunismDoesntWork t1_jef7r37 wrote
Unix adopted the philosophy that text is the ultimate API, which is why everything on Linux can be done through the CLI, including moving the mouse. And LLMs are very good at using text. So everything sort of does have an API.
Relevant_Ad7319 t1_jefsg9h wrote
Oh that’s cool I didn’t know that
arckeid t1_jeegitn wrote
I think this is a good way no just to make the AI, but to help humans to stay in sync, for me it's looking the advancements are already so fast.
[deleted] t1_jeer8ve wrote
[deleted]
SgathTriallair t1_jeerghs wrote
The task paper addressed this. If it can see the screen then in hasn't cases a keyboard and mouse API will be the best option.
How it knows where to click on the screen is that it is trained to understand images just like it understands text. So it will know that a trash can means you want to delete data the same way we know that.
[deleted] t1_jeea351 wrote
[deleted]
CaliforniaMax02 t1_jeetc75 wrote
There are a lot of tools which solve complex mouse and keyboard tasks and processes manually (UiPath, Blueprism, Automation Anywhere, etc.), which can be interfaced to this.
They can automatically open email attachments, copy texts, open an Excel (or any other) window, and enter the text structurally, etc.
Relevant_Ad7319 t1_jeftc4u wrote
It should be able to switch from doing taxes, browsing the web, and playing valorant within minutes just like a human can do. That’s not possible with UI path etc.
Sure in theory you can find/write an API for every task that you want it to do but for me that’s not what an AGI is
Viewing a single comment thread. View all comments