Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

1The Chinese University of Hong Kong, 2AWS Agentic AI, 3Amazon Web Services, 4AWS Robotics
Teaser Image

We introduce Talk2Move, a text-guided scene editing model for object-level geometric transformation, focusing on object translation, rotation and resizing, achieving superior results over current SOTA image editing models.

Abstract

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations—such as translating, rotating, or resizing objects—due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations.

Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

Video

Problem Formulation

We study text-guided geometric transformation for object-level scene editing, where the goal is to modify an object’s position, orientation, and scale in an image according to a text instruction, while keeping the scene static and unedited regions consistent. We focus on three fundamental spatial operations: object translation, rotation, and resizing, which together span the core dimensions of geometric object transformation. To regularize the instruction–action space, we use a set of predefined transformation templates for translation, rotation, and resizing. Each template specifies a target object and a spatial relation (e.g., “move the mug to the left”), providing a consistent basis for supervision and evaluation.

System Pipeline

Pipeline Image

Talk2Move streamlines a GRPO-style reinforcement learning pipeline tailored for flow-based image editing. Starting from an initial noise sample, stochastic perturbations are injected at each diffusion step to generate diverse sampling trajectories. Spatially grounded rewards from specialist models, which explicitly evaluate object-level geometric changes, are then used to compute group-relative advantages for policy gradient updates.

To improve training efficiency, we propose Step-wise Active Sampling, which selectively optimizes the most informative diffusion steps and take shortcut steps for early exit during rollout sampling. This strategy significantly reduces computational cost while preserving effective learning.

Active Sampling Image

Comparison of sampling strategies: (a) full GRPO sampling (FlowGRPO, DanceGRPO), (b) sliding-window optimization (MixGRPO, FlowGRPO-fast), and (c) our step-wise active sampling that focuses on informative steps and skips redundant ones for faster training.

Results

Quantitative Comparison

Quantitative results on object translation, rotation and resize over state-of-the-art image editing models. We report the editing accuracy and errors on the three subtasks. For each task, we provide results on both synthetic and real images to showcase the generalization ability of Talk2Move.

Quantitative Synthetic Image
Quantitative Real Image

Ablation: SFT vs. RL

  • With sufficient data (800 images), SFT provides a strong initialization, while RL further improves both accuracy and translation distance.
  • Under limited data (~10%), SFT fails to achieve meaningful gains, whereas RL maintains performance comparable to the full-data setting.
Ablation: SFT vs RL

Efficiency Analysis

We analyze the efficiency of active step sampling under the translation task. By reducing the number of sampled diffusion steps, our method significantly accelerates training while maintaining editing correctness.

  • Reduces total training time by 49% compared to full sampling
  • Achieves an additional 14% speedup over sliding-window sampling
  • Maintains comparable translation distance and accuracy
Efficiency Analysis: Active Step Sampling

Qualitative Comparison

Qualitative results on object translation, rotation and resize over state-of-the-art image editing models. For each task, we provide one real image editing result (source from OpenImagesV6) and one synthetic image editing result to showcase the generalization ability of Talk2Move.

Qualitative Image
More qualitative results for object translation
Supplementary Qualitative Image Translation
More qualitative results for object rotation
Supplementary Qualitative Image Rotation
More qualitative results for object resizing
Supplementary Qualitative Image Resize

BibTeX

@misc{tan2026talk2movereinforcementlearningtextinstructed,
      title={Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes}, 
      author={Jing Tan and Zhaoyang Zhang and Yantao Shen and Jiarui Cai and Shuo Yang and Jiajun Wu and Wei Xia and Zhuowen Tu and Stefano Soatto},
      year={2026},
      eprint={2601.02356},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.02356}, 
}