AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

1University of Hong Kong 2Salesforce Research
*Equal contribution

Abstract

Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose \ourwork, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model (VLM) agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.

Overview of the Pipeline

Overview of the AgentTrek Pipeline

Overview of the AgentTrek Pipeline:

  1. Automatic Tutorial Collection from the Internet: Tutorial-related data is extracted and filtered from internet sources using heuristic methods and a FastText model. An LLM processes the filtered textual data, transforming it into structured tutorials.
  2. Trajectory data collection via guided replay: A VLM agent interacts with the real digital environment guided by tutorials, while high-quality trajectory data, including observations, actions, and reasoning, is collected. Another VLM evaluator acts as a judger to further improve the effectiveness of the synthetic dataset.
  3. Training and fine-tuning with replay data: The collected trajectory data is used to train and fine-tune GUI agent models, which are evaluated on standard agent benchmarks, demonstrating significant improvements.

AgentTrek Dataset

AgentTrek is a large-scale multimodal agent trajectory dataset collected from web tutorials. The dataset contains two types of trajectories:

  • Text-based Trajectories: We collect 100K text-based trajectories from web tutorials, which contain step-by-step instructions and corresponding HTML observations. These trajectories are used to train pure text-based agents.
  • Vision-based Trajectories: We collect 50K vision-based trajectories by executing the text-based trajectories in real digital environments. Each trajectory contains a sequence of screenshots and corresponding actions. These trajectories are used to train pure vision-based agents.

With our AgentTrek pipeline, we generate large-scale trajectory data that excels in three areas. First, the dataset offers extensive diversity, covering multiple domains and task types, and benefiting from internet-sourced tutorials that enhance task execution. Our experiment showed a 230% performance increase when agents followed detailed instructions. Second, the data is gathered from real-world web environments, avoiding the limitations of simulations. Starting with RedPajama, we filtered and processed 23,430 tutorials, producing 10,398 successful trajectories from 127 websites.

Dataset Statistics and Examples

The data is comprehensive, capturing high- and low-level task details, including DOM/HTML structures, AXTree snapshots, video recordings, and screenshots. This rich data improves the agent's performance on long-horizon tasks, and with a per-trajectory cost of just $0.551, our pipeline offers an efficient, scalable solution for data generation.

Dataset Examples and Analysis

All trajectories are automatically collected and filtered by our pipeline, ensuring high quality and diversity. The dataset will be released soon to facilitate research in GUI agents.

Experiments

AgentTrek collects a large scale of multimodal agent trajectory from the internet. We finetune the VLM model with visual-based trajectory data to obtain a Pure vision-based agent, which is evaluated on Mind2Web and ScreenSpot. We also finetune the LLM model with text-based trajectory data to obtain a Pure text-based agent, which is evaluated on WebArena.

WebArena

Model Score
CodeLlama-7B-Instruct 0.00
LLaMa3-chat-8B 3.32
Qwen2.5-7B-Instruct 3.80
LLama3-chat-70B 7.02
GPT-4o 13.10
GPT-4 14.41
Synatra-CodeLlama-7B 6.28
AutoWebGLM (OOD SFT) 8.50
AutoWebGLM (In-domain RFT) 18.20
Qwen2.5-7B-Instruct w/ AgentTrek 10.46
Qwen2.5-32B-Instruct w/ AgentTrek 16.26

Mind2Web

Performance comparison across different methods and evaluation settings. 'H', 'I', 'AT', 'M2W' stand for HTML, Image, AgentTrek, Mind2Web

Obs Model Method Cross-Task Cross-Website Cross-Domain
Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR Ele.Acc Op.F1 Step SR
HTML GPT-3.5 Choice 19.4 59.2 16.8 14.9 56.5 14.1 25.2 57.9 24.1
GPT-4 Choice 40.8 63.1 32.3 30.2 61.0 27.0 35.4 61.9 29.7
H + I GPT-4 Choice 46.4 73.4 40.2 38.0 67.8 32.4 42.4 69.3 36.8
GPT-4 SoM 29.6 - 20.3 20.1 - 13.9 27.0 - 23.7
Image Qwen2-VL
+ AT Vision 45.5 84.9 40.9 40.8 82.8 35.1 48.6 84.1 42.1
+ M2W Vision 54.8 89.5 50.9 52.9 83.9 44.9 51.8 86.8 47.7
+ AT + M2W Vision 60.8 88.9 55.7 57.6 88.1 51.4 56.0 87.5 52.6

Grounding Performance on ScreenSpot Web

Model Text Icon/Widget Average
GPT-4 9.2 8.8 9.0
GPT-4o 12.2 7.8 10.1
Qwen2-VL 35.2 25.7 30.7
SeeClick 55.7 32.5 44.7
CogAgent 70.4 28.6 50.7
GPT-4 + OmniParser 81.3 51.0 67.0
Qwen2-VL-7B w/ AgentTrek 81.7 51.5 67.4

BibTeX

@article{xu2024agenttrek,
  author    = {Yiheng Xu and Dunjie Lu and Zhennan Shen and Junli Wang and Zekun Wang and Yuchen Mao and Caiming Xiong and Tao Yu},
  title     = {AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials},
  year={2024},
  eprint={2412.09605},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2412.09605}
}