Viewing a single comment thread. View all comments

basilgello t1_jeecmqt wrote

Correct, GPT4 is not meant to accept videos as input. And probably not screencasts but explained step-by-step prompts. For example, look at page 18 table 6: it is LangChain-like prompt. First, they define actions and tools and then language model puts the output which is actually high-level API call in some form. Using RPA as API, you get mouse clicker based on HTML context. Another thing HTML pages are crafted manually, and system still does not understand the unseen pages.

4