基于目标条件强化学习的无监督技能策略学习

张天; 王振利; 张一帆

doi:10.13973/j.cnki.robot.240336

基于目标条件强化学习的无监督技能策略学习

Unsupervised Skill Policy Learning Based on Goal-conditioned Reinforcement Learning

摘要

摘要: 基于互信息的无监督技能学习算法将智能体的探索过程与技能学习过程耦合，在复杂环境及长时序决策任务中容易引发探索局限性。为此，本文通过理论分析与实验验证，系统阐明了上述局限性的成因，并提出了一种基于目标条件强化学习的无监督技能策略学习方法。首先，对探索与技能学习过程进行解耦，设定目标空间为原始技能空间，通过目标之间的合理泛化提高目标条件策略的学习效率，从而优化探索过程并获得初始探索策略。随后，采用Go-Explore的交互方式对策略进行微调，进一步改善技能策略的学习效果。此外，基于状态空间覆盖率以及技能一致性构建了新的质量评估指标，用于全面评价技能策略的整体性能。在4个经典迷宫地图上的实验结果表明，所提的两阶段方法可以有效克服探索的局限性并加速技能学习进程，在新构建的指标下，相较于现有方法平均提升了72.1%。

Abstract: Mutual information-based unsupervised skill learning algorithms entangle an agent's exploration process with its skill learning process, which may result in exploration limitations in complex environments and long-horizon decision-making tasks. To address this issue, this paper systematically reveals the causes of these limitations through theoretical analysis and experimental validation, and proposes an unsupervised skill policy learning method based on goal-conditioned reinforcement learning. Firstly, exploration is decoupled from skill learning by treating the goal space as the primitive skill space. By leveraging generalization across goals, the learning efficiency of the goal-conditioned policy is improved, thereby improving exploration and obtaining an initial exploration policy. Subsequently, a Go-Explore interaction scheme is adopted for policy fine-tuning to further enhance skill policy learning. In addition, new quality evaluation metrics are constructed based on the state space coverage rate and the skill consistency to comprehensively assess the overall performance of skill policies. Experiments on 4 classic maze environments demonstrate that the proposed two-stage method effectively overcomes exploration limitations and accelerates skill learning, achieving an average improvement of 72.1% over existing methods under the proposed metrics.

HTML全文

参考文献(25)

施引文献

资源附件(0)