basilgello t1_jeecmqt wrote on March 31, 2023 at 12:18 PM

Reply to comment by Relevant_Ad7319 in Language Models can Solve Computer Tasks (by recursively criticizing and improving its output) by rationalkat

Correct, GPT4 is not meant to accept videos as input. And probably not screencasts but explained step-by-step prompts. For example, look at page 18 table 6: it is LangChain-like prompt. First, they define actions and tools and then language model puts the output which is actually high-level API call in some form. Using RPA as API, you get mouse clicker based on HTML context. Another thing HTML pages are crafted manually, and system still does not understand the unseen pages.