TL;DR: A trinity of environments, tools, and benchmarks for general virtual agents
AgentStudio targets the desiderata for robust, general, and open-ended virtual agents by providing: (1) a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions, (2) tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos, (3) online benchmark tasks that evaluate both GUI interactions and function calling with auto-evaluation and language feedback, and (4) three benchmark datasets: GroundUI, IDMBench, and CriticBench, for fundamental agent abilities, including GUI grounding, learning from videos, and success detection.
For more details on AgentStudio environments, tools, and benchmarks, please refer to our paper and code.
Resources
All the files for online benchmark tasks and the images for the three datasets are available at Google Drive. We also provide the three datasets on Hugging Face. The jsonl files of GroundUI-1K and Trajectory-Lite can also be found in our GitHub repository. Please feel free to raise a GitHub issue if you have any questions or comments, or want to submit new benchmark results.