AgentStudio

A Toolkit for Building General Virtual Agents

Longtao Zheng¹*, Zhiyuan Huang³*, Zhenghai Xue¹, Xinrun Wang¹, Bo An^1,2, Shuicheng Yan²

¹Nanyang Technological University, Singapore ²Skywork AI, Singapore ³ETH Zurich (*Equal contribution)

TL;DR: A trinity of environments, tools, and benchmarks for general virtual agents

AgentStudio targets the desiderata for robust, general, and open-ended virtual agents by providing: (1) a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions, (2) tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos, (3) online benchmark tasks that evaluate both GUI interactions and function calling with auto-evaluation and language feedback, and (4) three benchmark datasets: GroundUI, IDMBench, and CriticBench, for fundamental agent abilities, including GUI grounding, learning from videos, and success detection.

For more details on AgentStudio environments, tools, and benchmarks, please refer to our paper and code.

Resources

All the files for online benchmark tasks and the images for the three datasets are available at Google Drive. We also provide the three datasets on Hugging Face. The jsonl files of GroundUI-1K and Trajectory-Lite can also be found in our GitHub repository. Please feel free to raise a GitHub issue if you have any questions or comments, or want to submit new benchmark results.

AgentStudio Online Benchmark Leaderboard

The online benchmark suites consists of 205 tasks. These tasks span API usages such as terminal and Gmail and GUI software like VS Code in the AgentStudio environment. Solving these tasks requires various fundamental agent abilities, including general grounding through complex action space.

Model	Single API	Single GUI	Compositional	Total	Date
claude-3-5-sonnet-20240620	82.0	20.0	25.0	36.6	2024-10-02
gpt-4o-2024-08-06	72.0	24.2	23.3	35.6	2024-10-02
gemini-1.5-pro-001	36.0	13.6	5.0	16.6	2024-10-02
gemini-1.5-flash-001	28.0	9.5	6.7	13.2	2024-10-02

Single API (Details)

Single-API tasks consist of tasks that can be accomplished through direct API calling, with the text only observation space.

Model	OS	Google Docs	Google Calendar	Gmail	Date
claude-3-5-sonnet-20240620	94.7	42.9	90.9	76.9	2024-10-02
gpt-4o-2024-08-06	100.0	28.6	81.8	46.2	2024-10-02
gemini-1.5-pro-001	68.4	0.0	18.2	23.1	2024-10-02
gemini-1.5-flash-001	52.6	14.3	9.1	15.4	2024-10-02

Single GUI (Details)

Single-GUI tasks involve common daily applications where agents are provided with screenshots as well as text observations. These tasks can be accomplished through GUI or API calling.

Model	OS	VSCode	Libreoffice Impress	Date
gpt-4o-2024-08-06	94.7	15.0	13.3	2024-10-02
claude-3-5-sonnet-20240620	94.7	5.0	0.0	2024-10-02
gemini-1.5-pro-001	63.2	5.0	0.0	2024-10-02
gemini-1.5-flash-001	47.4	0.0	0.0	2024-10-02

Data Sample of Task Configuration

Task configuration with simplified evaluation/reset/cleanup procedures.

Key	Value
Task ID	08aced46-45a2-48d7-993b-ed3fb5b32302
Instruction	Give the slide 2 a right aligned title, "Note".
Visual	True
Max Steps	30
Max Time	60.0
Evaluation Procedure	Compare between "ref.pptx" and "target.pptx"
Reset Procedure	1. Create folder structure, 2. Copy file, 3. Open PPTX file
Cleanup Procedure	1. Delete folder structure, 2. Kill LibreOffice process

GroundUI Leaderboard

UI grounding with accurate coordinates is one of the main challenges for human-like computer agents, since not all interactable elements are not readily available. It has also been validated that current models can already generate correct high-level planning in text space, but struggle to ground them into accurate actions. However, there are few existing benchmark provide evaluation results on UI grounding capabilities across different applications and paltforms. In AgentStudio, we systematically re-organize existing datasets, plus self-collected data, into 18K diverse and realistic data with recaptioned clear instructions to benchmark UI grounding.

Model	Web	Desktop	Mobile	Total	Date
SeeClick	64.3	44.3	73.7	61.1	2024-06-06
gemini-1.5-pro-001	31.2	24.3	51.3	35.2	2024-08-17
CogAgent	25.3	15.7	35.7	25.5	2024-06-06
claude-3-5-sonnet-20240620	13.0	14.0	26.3	17.3	2024-08-17
claude-3-5-sonnet-20241022	9.5	12.7	29.0	16.3	2024-10-23
gpt-4o-2024-05-13	7.5	8.3	26.3	13.4	2024-06-06
gpt-4-turbo-2024-04-09	5.3	11.0	23.0	12.3	2024-06-06
gemini-1.5-flash-001	0.5	4.3	26.3	9.4	2024-06-06
CogVLM2-Llama3-chat-19B	2.5	2.7	5.3	3.4	2024-06-06
Gemini-1.0 Pro	0.5	0.3	5.0	1.8	2024-06-06
MiniCPM-Llama3-V 2.5	0.0	0.3	2.7	0.9	2024-06-06
Qwen-VL-Chat	0.0	0.0	0.0	0.0	2024-06-06
PaliGemma-3B-896	0.0	0.0	0.0	0.0	2024-08-17
PaliGemma-3B-mix-448	0.0	0.0	0.0	0.0	2024-08-17

Data Sample

We collected screenshots from test set of existing datasets across web, desktop, and mobile devices, and additional screenshots collected with AgentStudio toolkits. We augment the instructions into detailed and clear ones with the help of GPT-4o. These data add up to 18K UI grounding dataset. For efficient benchmarking, we conduct experiments on a subset, GroundUI-1K, which contains 400, 300, and 300 samples for web, desktop, and mobile devices, respectively.

IDMBench Leaderboard

Unlocking the ability to learn from videos is a key capability for next-generation computer agents to achieve generalization and lifelong learning. Therefore, we present the first dataset designed to measure the ability to learn how to act from videos. Specifically, we evaluate current multimodal models as inverse dynamics models, predicting actions from unlabeled videos in two separate settings. In the first scenario, we provide the model with two neighboring screenshots, one before and one after an action, and ask it to predict the action that occurred in between. In the second, which is a more general scenario, we provide the model with states for multiple actions (which can be viewed as video frames) and ask it to predict all the actions within the frames.

IDM-Single (Accuracy %)

Model	Mind2Web	AITW	VWA	AgentStudio	Total	Date
claude-3-5-sonnet-20240620	73.0	56.0	50.0	72.0	61.4	2024-08-17
gpt-4o-2024-05-13	70.0	56.0	45.0	78.0	60.0	2024-06-06
gemini-1.5-pro-001	62.0	51.0	46.0	48.0	52.3	2024-08-17
gemini-1.5-flash-001	65.0	34.0	31.0	60.0	45.7	2024-08-17
Qwen-VL-Chat	37.0	20.0	5.0	20.0	20.6	2024-06-06

IDM-Multiple (Accuracy %)

Model	Mind2Web	AITW	VWA	AgentStudio	Total	Date
claude-3-5-sonnet-20240620	18.0	8.0	7.0	22.2	12.5	2024-08-17
gpt-4o-2024-05-13	13.0	8.0	2.0	20.0	9.3	2024-06-06
gemini-1.5-pro-001	0.0	0.0	1.0	2.2	0.6	2024-08-17
Qwen-VL-Chat	0.0	0.0	0.0	0.0	0.0	2024-06-06
gemini-1.5-flash-001	0.0	0.0	0.0	0.0	0.0	2024-08-17

IDM-Multiple (Edit Distance)

Model	Mind2Web	AITW	VWA	AgentStudio	Total	Date
claude-3-5-sonnet-20240620	2.0	2.1	2.9	1.6	2.3	2024-08-17
gpt-4o-2024-05-13	2.1	2.2	3.5	2.0	2.5	2024-06-06
gemini-1.5-pro-001	6.0	4.4	7.0	3.8	5.5	2024-08-17
Qwen-VL-Chat	5.1	15.4	5.8	6.3	8.4	2024-06-06
gemini-1.5-flash-001	294.5	7.2	7.2	7.8	90.6	2024-08-17

Data Sample of IDM-Single

Data Sample of IDM-Multiple

CriticBench Leaderboard

The ability to self-evaluate and learn from environment interactions is one of the core abilities of agents. However, there are currently few benchmarks that focus on and measure the ability of computer agents to judge whether a trajectory is successful.

With Observation-Action Pairs (Accuracy %)

Model	Web	Desktop	Mobile	Total	Date
gemini-1.5-pro-001	75.3	88.9	70.0	76.7	2024-08-17
gemini-1.5-flash-001	72.3	83.9	72.7	74.8	2024-08-17
claude-3-5-sonnet-20240620	72.2	100.0	61.9	75.9	2024-08-17
gpt-4o-2024-05-13	69.1	93.1	68.2	73.6	2024-06-06
Qwen-VL-Chat	51.7	48.7	49.0	50.2	2024-06-06

With Observations Only (Accuracy %)

Model	Web	Desktop	Mobile	Total	Date
gemini-1.5-pro-001	68.8	89.7	65.5	72.5	2024-08-17
gemini-1.5-flash-001	70.2	80.8	70.0	72.1	2024-08-17
claude-3-5-sonnet-20240620	67.4	96.0	63.3	71.4	2024-08-17
gpt-4o-2024-05-13	65.2	92.3	66.7	70.6	2024-06-06
Qwen-VL-Chat	53.1	59.2	51.0	53.4	2024-06-06

Data Sample

We collect trajectories from both existing environments such as AITW, Mind2Web, VisualWebArena, etc. and AgentStudio's real-world environments, resulting in diverse trajectories of both human and agents on web, desktop, and mobile environments. Since most human trajectories are successful, we balance the dataset by labeling partial trajectories as failure cases.

AgentStudio Environment

AgentStudio provides an interactive, realistic, and lightweight environment with generic observation and action spaces, enabling agents to interact with arbitrary software. The observation space incorporates multiple modalities, ranging from screen recordings (videos) and screenshots (images) to code execution results (text). Agents can act through human-computer interfaces (e.g., keyboard-mouse operations) to control third-party applications, and perform function calling to interact with APIs. These features expand the task space to massively open-domain and real-world tasks typically performed by humans. The interactive nature of online environments allows agents to learn through trial and error, which is enhanced by the language feedback on failure reasons provided by our environment.

Comparisons with existing work:

AgentStudio Tools

Online Benchmark GUI

The figure below shows the process of running our online benchmark using AgentStudio toolkits. You can select the tasks, start executing the tasks with the agents, and evaluate the agents performance afterwards. You can even create your own agents, tasks and evaluators easily with our toolkits.

GroundUI Annotator

Here is an example of recording single-step GUI grounding data in MacOS.

Trajectory Recorder/Editor

Here is an example video of recordinging and editing video trajectories with action labels.

Citation

If you find the data or code useful, please consider cite us:

@article{zheng2024agentstudio,
  title={AgentStudio: A Toolkit for Building General Virtual Agents},
  author={Longtao Zheng and Zhiyuan Huang and Zhenghai Xue and Xinrun Wang and Bo An and Shuicheng Yan},
  journal={arXiv preprint arXiv:2403.17918},
  year={2024}
}

Website template from SWE-bench