InquireMobile

Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Abstract

Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce InquireBench, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose InquireMobile, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.

Why InquireMobile?

grade-lv

An example of a high-stakes scenario involving irreversible file deletion, which requires human confirmation before execution. In fact, situations requiring human assistance are widespread.

Benchmark Statistics

grade-lv

Distribution of our InquireBench dataset. The top three frequent apps are listed for each category.

Data Collection

grade-lv

Data Collection Pipeline of our InquireBench. Among them, we employ a random walk approach to trigger the potential inquiry scenario, in which the agent seek human assistance.

Training Pipeline

grade-lv

Our training framework consists of two stages:

Stage 1: Cold Start using SFT

Format finetuning by supervised fine-tuning (SFT).

Stage 2: RL using GRPO

Inquiry enhancement via GRPO training with verifiable rewards, to improve the agent's interactive capabilities.

Main Results

grade-lv

Main results on InquireBench. ISR denotes the inquiry success rate and SR denotes the task success rate.

Case Study

grade-lv

Comparison of InquireMobile and Qwen2.5-VL-3B-Instruct

    Task: "One of my socks is torn"

  • Qwen2.5-VL-3B-Instruct: Failed at the second step.
  • InquireMobile : Successfully inquired and then completed the whole task.