AgentStudio

A Toolkit for Building General Virtual Agents

Longtao Zheng1*, Zhiyuan Huang3*, Zhenghai Xue1, Xinrun Wang1, Bo An1,2, Shuicheng Yan2

1Nanyang Technological University, Singapore   2Skywork AI, Singapore   3ETH Zurich   (*Equal contribution)

TL;DR: A trinity of environments, tools, and benchmarks for general virtual agents

AgentStudio targets the desiderata for robust, general, and open-ended virtual agents by providing: (1) a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions, (2) tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos, (3) online benchmark tasks that evaluate both GUI interactions and function calling with auto-evaluation and language feedback, and (4) three benchmark datasets: GroundUI, IDMBench, and CriticBench, for fundamental agent abilities, including GUI grounding, learning from videos, and success detection.

For more details on AgentStudio environments, tools, and benchmarks, please refer to our paper and code.

Resources

All the files for online benchmark tasks and the images for the three datasets are available at Google Drive. We also provide the three datasets on Hugging Face. The jsonl files of GroundUI-1K and Trajectory-Lite can also be found in our GitHub repository. Please feel free to raise a GitHub issue if you have any questions or comments, or want to submit new benchmark results.

AgentStudio Online Benchmark Leaderboard

The online benchmark suites consists of 205 tasks. These tasks span API usages such as terminal and Gmail and GUI software like VS Code in the AgentStudio environment. Solving these tasks requires various fundamental agent abilities, including general grounding through complex action space.

Model
Single API
Single GUI
Compositional
Total
Date

claude-3-5-sonnet-20240620

82.0

20.0

25.0

36.6

2024-10-02

gpt-4o-2024-08-06

72.0

24.2

23.3

35.6

2024-10-02

gemini-1.5-pro-001

36.0

13.6

5.0

16.6

2024-10-02

gemini-1.5-flash-001

28.0

9.5

6.7

13.2

2024-10-02


Single API (Details)

Single-API tasks consist of tasks that can be accomplished through direct API calling, with the text only observation space.

Model
OS
Google
Docs
Google
Calendar
Gmail
Date

claude-3-5-sonnet-20240620

94.7

42.9

90.9

76.9

2024-10-02

gpt-4o-2024-08-06

100.0

28.6

81.8

46.2

2024-10-02

gemini-1.5-pro-001

68.4

0.0

18.2

23.1

2024-10-02

gemini-1.5-flash-001

52.6

14.3

9.1

15.4

2024-10-02


Single GUI (Details)

Single-GUI tasks involve common daily applications where agents are provided with screenshots as well as text observations. These tasks can be accomplished through GUI or API calling.

Model
GIMP
OS
VSCode
Libreoffice
Impress
Libreoffice
Calc
Libreoffice
Writer
Date

gpt-4o-2024-08-06

0.0

94.7

15.0

13.3

0.0

0.0

2024-10-02

claude-3-5-sonnet-20240620

0.0

94.7

5.0

0.0

0.0

0.0

2024-10-02

gemini-1.5-pro-001

0.0

63.2

5.0

0.0

0.0

0.0

2024-10-02

gemini-1.5-flash-001

0.0

47.4

0.0

0.0

0.0

0.0

2024-10-02


Data Sample of Task Configuration

Task configuration with simplified evaluation/reset/cleanup procedures.

Key
Value

Task ID

08aced46-45a2-48d7-993b-ed3fb5b32302

Instruction

Give the slide 2 a right aligned title, "Note".

Visual

True

Max Steps

30

Max Time

60.0

Evaluation Procedure

Compare between "ref.pptx" and "target.pptx"

Reset Procedure

1. Create folder structure, 2. Copy file, 3. Open PPTX file

Cleanup Procedure

1. Delete folder structure, 2. Kill LibreOffice process

GroundUI Leaderboard

UI grounding with accurate coordinates is one of the main challenges for human-like computer agents, since not all interactable elements are not readily available. It has also been validated that current models can already generate correct high-level planning in text space, but struggle to ground them into accurate actions. However, there are few existing benchmark provide evaluation results on UI grounding capabilities across different applications and paltforms. In AgentStudio, we systematically re-organize existing datasets, plus self-collected data, into 18K diverse and realistic data with recaptioned clear instructions to benchmark UI grounding.

Model
Web
Desktop
Mobile
Total
Date

SeeClick

64.3

44.3

73.7

61.1

2024-06-06

gemini-1.5-pro-001

31.2

24.3

51.3

35.2

2024-08-17

CogAgent

25.3

15.7

35.7

25.5

2024-06-06

claude-3-5-sonnet-20240620

13.0

14.0

26.3

17.3

2024-08-17

claude-3-5-sonnet-20241022

9.5

12.7

29.0

16.3

2024-10-23

gpt-4o-2024-05-13

7.5

8.3

26.3

13.4

2024-06-06

gpt-4-turbo-2024-04-09

5.3

11.0

23.0

12.3

2024-06-06

gemini-1.5-flash-001

0.5

4.3

26.3

9.4

2024-06-06

CogVLM2-Llama3-chat-19B

2.5

2.7

5.3

3.4

2024-06-06

Gemini-1.0 Pro

0.5

0.3

5.0

1.8

2024-06-06

MiniCPM-Llama3-V 2.5

0.0

0.3

2.7

0.9

2024-06-06

Qwen-VL-Chat

0.0

0.0

0.0

0.0

2024-06-06

PaliGemma-3B-896

0.0

0.0

0.0

0.0

2024-08-17

PaliGemma-3B-mix-448

0.0

0.0

0.0

0.0

2024-08-17


Data Sample

We collected screenshots from test set of existing datasets across web, desktop, and mobile devices, and additional screenshots collected with AgentStudio toolkits. We augment the instructions into detailed and clear ones with the help of GPT-4o. These data add up to 18K UI grounding dataset. For efficient benchmarking, we conduct experiments on a subset, GroundUI-1K, which contains 400, 300, and 300 samples for web, desktop, and mobile devices, respectively.

IDMBench Leaderboard

Unlocking the ability to learn from videos is a key capability for next-generation computer agents to achieve generalization and lifelong learning. Therefore, we present the first dataset designed to measure the ability to learn how to act from videos. Specifically, we evaluate current multimodal models as inverse dynamics models, predicting actions from unlabeled videos in two separate settings. In the first scenario, we provide the model with two neighboring screenshots, one before and one after an action, and ask it to predict the action that occurred in between. In the second, which is a more general scenario, we provide the model with states for multiple actions (which can be viewed as video frames) and ask it to predict all the actions within the frames.

IDM-Single (Accuracy %)


Model
Mind2Web
AITW
VWA
AgentStudio
Total
Date

claude-3-5-sonnet-20240620

73.0

56.0

50.0

72.0

61.4

2024-08-17

gpt-4o-2024-05-13

70.0

56.0

45.0

78.0

60.0

2024-06-06

gemini-1.5-pro-001

62.0

51.0

46.0

48.0

52.3

2024-08-17

gemini-1.5-flash-001

65.0

34.0

31.0

60.0

45.7

2024-08-17

Qwen-VL-Chat

37.0

20.0

5.0

20.0

20.6

2024-06-06


IDM-Multiple (Accuracy %)


Model
Mind2Web
AITW
VWA
AgentStudio
Total
Date

claude-3-5-sonnet-20240620

18.0

8.0

7.0

22.2

12.5

2024-08-17

gpt-4o-2024-05-13

13.0

8.0

2.0

20.0

9.3

2024-06-06

gemini-1.5-pro-001

0.0

0.0

1.0

2.2

0.6

2024-08-17

Qwen-VL-Chat

0.0

0.0

0.0

0.0

0.0

2024-06-06

gemini-1.5-flash-001

0.0

0.0

0.0

0.0

0.0

2024-08-17


IDM-Multiple (Edit Distance)


Model
Mind2Web
AITW
VWA
AgentStudio
Total
Date

claude-3-5-sonnet-20240620

2.0

2.1

2.9

1.6

2.3

2024-08-17

gpt-4o-2024-05-13

2.1

2.2

3.5

2.0

2.5

2024-06-06

gemini-1.5-pro-001

6.0

4.4

7.0

3.8

5.5

2024-08-17

Qwen-VL-Chat

5.1

15.4

5.8

6.3

8.4

2024-06-06

gemini-1.5-flash-001

294.5

7.2

7.2

7.8

90.6

2024-08-17


Data Sample of IDM-Single



Data Sample of IDM-Multiple


CriticBench Leaderboard

The ability to self-evaluate and learn from environment interactions is one of the core abilities of agents. However, there are currently few benchmarks that focus on and measure the ability of computer agents to judge whether a trajectory is successful.

With Observation-Action Pairs (Accuracy %)


Model
Web
Desktop
Mobile
Total
Date

gemini-1.5-pro-001

75.3

88.9

70.0

76.7

2024-08-17

gemini-1.5-flash-001

72.3

83.9

72.7

74.8

2024-08-17

claude-3-5-sonnet-20240620

72.2

100.0

61.9

75.9

2024-08-17

gpt-4o-2024-05-13

69.1

93.1

68.2

73.6

2024-06-06

Qwen-VL-Chat

51.7

48.7

49.0

50.2

2024-06-06


With Observations Only (Accuracy %)


Model
Web
Desktop
Mobile
Total
Date

gemini-1.5-pro-001

68.8

89.7

65.5

72.5

2024-08-17

gemini-1.5-flash-001

70.2

80.8

70.0

72.1

2024-08-17

claude-3-5-sonnet-20240620

67.4

96.0

63.3

71.4

2024-08-17

gpt-4o-2024-05-13

65.2

92.3

66.7

70.6

2024-06-06

Qwen-VL-Chat

53.1

59.2

51.0

53.4

2024-06-06


Data Sample

We collect trajectories from both existing environments such as AITW, Mind2Web, VisualWebArena, etc. and AgentStudio's real-world environments, resulting in diverse trajectories of both human and agents on web, desktop, and mobile environments. Since most human trajectories are successful, we balance the dataset by labeling partial trajectories as failure cases.

AgentStudio Environment

AgentStudio provides an interactive, realistic, and lightweight environment with generic observation and action spaces, enabling agents to interact with arbitrary software. The observation space incorporates multiple modalities, ranging from screen recordings (videos) and screenshots (images) to code execution results (text). Agents can act through human-computer interfaces (e.g., keyboard-mouse operations) to control third-party applications, and perform function calling to interact with APIs. These features expand the task space to massively open-domain and real-world tasks typically performed by humans. The interactive nature of online environments allows agents to learn through trial and error, which is enhanced by the language feedback on failure reasons provided by our environment.

Comparisons with existing work:

AgentStudio Tools


Online Benchmark GUI

The figure below shows the process of running our online benchmark using AgentStudio toolkits. You can select the tasks, start executing the tasks with the agents, and evaluate the agents performance afterwards. You can even create your own agents, tasks and evaluators easily with our toolkits.


GroundUI Annotator

Here is an example of recording single-step GUI grounding data in MacOS.


Trajectory Recorder/Editor

Here is an example video of recordinging and editing video trajectories with action labels.

Citation

If you find the data or code useful, please consider cite us:

@article{zheng2024agentstudio,
  title={AgentStudio: A Toolkit for Building General Virtual Agents},
  author={Longtao Zheng and Zhiyuan Huang and Zhenghai Xue and Xinrun Wang and Bo An and Shuicheng Yan},
  journal={arXiv preprint arXiv:2403.17918},
  year={2024}
}

Website template from SWE-bench