Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose \ourwork, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model (VLM) agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.
Overview of the AgentTrek Pipeline:
AgentTrek is a large-scale multimodal agent trajectory dataset collected from web tutorials. The dataset contains two types of trajectories:
With our AgentTrek pipeline, we generate large-scale trajectory data that excels in three areas. First, the dataset offers extensive diversity, covering multiple domains and task types, and benefiting from internet-sourced tutorials that enhance task execution. Our experiment showed a 230% performance increase when agents followed detailed instructions. Second, the data is gathered from real-world web environments, avoiding the limitations of simulations. Starting with RedPajama, we filtered and processed 23,430 tutorials, producing 10,398 successful trajectories from 127 websites.
The data is comprehensive, capturing high- and low-level task details, including DOM/HTML structures, AXTree snapshots, video recordings, and screenshots. This rich data improves the agent's performance on long-horizon tasks, and with a per-trajectory cost of just $0.551, our pipeline offers an efficient, scalable solution for data generation.
All trajectories are automatically collected and filtered by our pipeline, ensuring high quality and diversity. The dataset will be released soon to facilitate research in GUI agents.
AgentTrek collects a large scale of multimodal agent trajectory from the internet. We finetune the VLM model with visual-based trajectory data to obtain a Pure vision-based agent, which is evaluated on Mind2Web and ScreenSpot. We also finetune the LLM model with text-based trajectory data to obtain a Pure text-based agent, which is evaluated on WebArena.
Model | Score |
---|---|
CodeLlama-7B-Instruct | 0.00 |
LLaMa3-chat-8B | 3.32 |
Qwen2.5-7B-Instruct | 3.80 |
LLama3-chat-70B | 7.02 |
GPT-4o | 13.10 |
GPT-4 | 14.41 |
Synatra-CodeLlama-7B | 6.28 |
AutoWebGLM (OOD SFT) | 8.50 |
AutoWebGLM (In-domain RFT) | 18.20 |
Qwen2.5-7B-Instruct w/ AgentTrek | 10.46 |
Qwen2.5-32B-Instruct w/ AgentTrek | 16.26 |
Performance comparison across different methods and evaluation settings. 'H', 'I', 'AT', 'M2W' stand for HTML, Image, AgentTrek, Mind2Web
Obs | Model | Method | Cross-Task | Cross-Website | Cross-Domain | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Ele.Acc | Op.F1 | Step SR | Ele.Acc | Op.F1 | Step SR | Ele.Acc | Op.F1 | Step SR | |||
HTML | GPT-3.5 | Choice | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
GPT-4 | Choice | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 | |
H + I | GPT-4 | Choice | 46.4 | 73.4 | 40.2 | 38.0 | 67.8 | 32.4 | 42.4 | 69.3 | 36.8 |
GPT-4 | SoM | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 | |
Image | Qwen2-VL | ||||||||||
+ AT | Vision | 45.5 | 84.9 | 40.9 | 40.8 | 82.8 | 35.1 | 48.6 | 84.1 | 42.1 | |
+ M2W | Vision | 54.8 | 89.5 | 50.9 | 52.9 | 83.9 | 44.9 | 51.8 | 86.8 | 47.7 | |
+ AT + M2W | Vision | 60.8 | 88.9 | 55.7 | 57.6 | 88.1 | 51.4 | 56.0 | 87.5 | 52.6 |
Model | Text | Icon/Widget | Average |
---|---|---|---|
GPT-4 | 9.2 | 8.8 | 9.0 |
GPT-4o | 12.2 | 7.8 | 10.1 |
Qwen2-VL | 35.2 | 25.7 | 30.7 |
SeeClick | 55.7 | 32.5 | 44.7 |
CogAgent | 70.4 | 28.6 | 50.7 |
GPT-4 + OmniParser | 81.3 | 51.0 | 67.0 |
Qwen2-VL-7B w/ AgentTrek | 81.7 | 51.5 | 67.4 |
@article{xu2024agenttrek,
author = {Yiheng Xu and Dunjie Lu and Zhennan Shen and Junli Wang and Zekun Wang and Yuchen Mao and Caiming Xiong and Tao Yu},
title = {AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials},
year={2024},
eprint={2412.09605},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09605}
}