Long-horizon Task Planning Based on Multi-modal Diffusion Policy

LUO Jiayuan; LIU Zeyang; LAN Xuguang

doi:10.13973/j.cnki.robot.250192

LUO Jiayuan, LIU Zeyang, LAN Xuguang. Long-horizon Task Planning Based on Multi-modal Diffusion Policy[J]. ROBOT, 2025, 47(4): 548-558. DOI: 10.13973/j.cnki.robot.250192

Citation:

LUO Jiayuan, LIU Zeyang, LAN Xuguang. Long-horizon Task Planning Based on Multi-modal Diffusion Policy[J]. ROBOT, 2025, 47(4): 548-558. DOI: 10.13973/j.cnki.robot.250192

Citation:

LUO Jiayuan, LIU Zeyang, LAN Xuguang. Long-horizon Task Planning Based on Multi-modal Diffusion Policy[J]. ROBOT, 2025, 47(4): 548-558. DOI: 10.13973/j.cnki.robot.250192

Long-horizon Task Planning Based on Multi-modal Diffusion Policy

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In robotic operations for long-horizon tasks, the sequences of offline skill-learning actions are diverse, the relationships between natural language instruction comprehension and long-horizon task semantics are complex, and the information density is high. To address these challenges, a long-horizon task planning algorithm based on multi-modal diffusion policy (named MMDPP) is proposed to improve the task completion rate and robustness in complex environments. The method uses a large visual language model to transform natural language tasks into structured task elements, introduces a multimodal fusion module to model the low-dimensional state, image observation and task semantics in a unified way, and uses selective channels to reduce the gradient conflict and the gradient cross-interference. A conditional diffusion generation model is constructed on this basis to directly output structurally consistent and task-aligned action sequences, realizing endto-end strategy planning from language input to action prediction. In the MuJoCo-Kitchen-Image kitchen environment (selfconstructed dataset), the MMDPP method significantly outperforms the baseline method in long-horizon task success rate; in the Robosuite-Kitchen environment, it surpasses SiMPL by 2.4%; and it achieves an 80% success rate on the UR5 physical robot platform in table-top rearrangement tasks, demonstrating good accuracy and realistic adaptability in the manipulation tasks. The adaptability of action policy learning to task changes in long-horizon tasks is significantly enhanced by the proposed method, providing an effective paradigm for long-horizon robot planning based on diffusion modeling.

FullText(HTML)

References (36)

Cited By

Long-horizon Task Planning Based on Multi-modal Diffusion Policy

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content