Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

ICML 2025

1Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Republic of Korea, 2Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Republic of Korea,

Abstract

Existing offline hierarchical reinforcement learning (HRL) methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.


Why is Trajectory Stitching Crucial in Offline HRL?

Trajectory Stitching illustration

The GAS Framework

Overview

Overview illustration
(1) A TDR is pretrained from an offline dataset. (2) A graph is constructed by selecting only high-TE states based on the TE metric. (3) A TD-based subgoal-conditioned low-level policy is trained using all states. (4) The graph is utilized for task planning and subgoal selection, while action execution is performed by the low-level policy.

Key Ideas

Temporal Distance Representation (TDR)

TDR illustration
  • TDR \( \psi \) embeds states into a latent space \( \mathcal{H} \), where the Euclidean distance between any two points corresponds to the minimum number of steps required to transition from one state to another in the raw state space \( \mathcal{S} \).
  • This representation preserves global temporal structure through the following objective: \(\mathbb{E}_{(s,s',g)\sim\mathcal{D}} \bigl[\ell_\tau^2\!\bigl(-\mathbf{1}\{s\neq g\} -\gamma\,\lVert\psi(s') -\psi(g)\rVert_2+\lVert\psi(s)-\psi(g)\rVert_2\bigr)\bigr]\)

Temporal Efficiency (TE)

TE illustration
  • TE measures the directional alignment between the actual and optimal transitions over a fixed temporal distance.
  • Specifically, given a state \( s_{\text{cur}} \), the optimal future state \( s_{\text{opt}} \) is defined as the state observed after a temporal distance of \( H_{\text{TD}} \) within the same trajectory, and the actual reached state \( s_{\text{reached}} \) is observed \( H_{\text{TD}} \) steps ahead.
  • TE is then computed via cosine similarity in TDR space: \(\theta_{\text{TE}} = \cos\bigl(\psi(s_{\text{opt}})-\psi(s_{\text{cur}}),\,\psi(s_{\text{reached}})-\psi(s_{\text{cur}})\bigr)\)

TD-aware Graph Construction

TD-aware Graph Construction illustration
  • GAS clusters states in the TDR space at regular temporal distance intervals \( H_{\text{TD}} \), grouping semantically similar states from different trajectories.
  • Each cluster center becomes a graph node, and edges are added between nodes if their temporal distance is below \( H_{\text{TD}} \), enabling stitching across disconnected trajectories.

TD-aware Subgoal Sampling

TD-aware subgoal sampling illustration
  • A subgoal \( s_{\text{sub}} \) is selected based on a fixed temporal distance \( H_{\text{TD}} \) within the same trajectory.
  • To train the low-level policy, the selected subgoal is transformed into a direction vector \( \vec{h}_{\text{dir}} = \operatorname{dir}(\psi(s_t), \psi(s_{\text{sub}})) \), and used in the low-level policy objective: \(\displaystyle \mathbb{E}_{\mathcal{D}} \Bigl[ Q\!\bigl(s_t,\mu^\pi(s_t,\vec{h}_{\text{dir}}),\vec{h}_{\text{dir}}\bigr) +\, \alpha\,\log \pi\!\bigl(a \mid s_t,\vec{h}_{\text{dir}}\bigr) \Bigr] \)

Experiments

Datasets

Datasets illustration

Results on state-based environments

main_table_(state) illustration
main_table_(state) illustration

Results on pixel-based environments

main_table_(pixel) illustration
main_table_(state) illustration

Performance Highlights


Subgoal Visualizations

Subgoal Visualizations illustration

BibTeX

    @inproceedings{gas_baek2025,
    title={Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning},
    author={Seungho Baek and Taegeon Park and Jongchan Park and Seungjun Oh and Yusung Kim},
    booktitle={International Conference on Machine Learning (ICML)},
    year={2025},
    }