GAS

Abstract

Existing offline hierarchical reinforcement learning (HRL) methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.

Why is Trajectory Stitching Crucial in Offline HRL?

Stitching composes new trajectories by connecting partial segments from different goal-oriented trajectories.
This enables agents to utilize transitions that are temporally disjoint and not observed within a single trajectory.
Such stitching facilitates generalization across goals, especially in sparse-reward and long-horizon tasks.
However, existing hierarchical reinforcement learning (HRL) methods typically lack mechanisms for cross-goal stitching, limiting their ability to compose effective subgoal sequences from suboptimal trajectories.

The GAS Framework

Overview

We propose Graph-Assisted Stitching (GAS), a novel offline HRL framework that eliminates the need for explicit high-level policy learning.
Instead, GAS formulates subgoal selection as a graph-based approach to enable efficient long-horizon reasoning and state transition stitching.

Key Ideas

Temporal Distance Representation (TDR)

TDR \( \psi \) embeds states into a latent space \( \mathcal{H} \), where the Euclidean distance between any two points corresponds to the minimum number of steps required to transition from one state to another in the raw state space \( \mathcal{S} \).
This representation preserves global temporal structure through the following objective: \(\mathbb{E}_{(s,s',g)\sim\mathcal{D}} \bigl[\ell_\tau^2\!\bigl(-\mathbf{1}\{s\neq g\} -\gamma\,\lVert\psi(s') -\psi(g)\rVert_2+\lVert\psi(s)-\psi(g)\rVert_2\bigr)\bigr]\)

Temporal Efficiency (TE)

TE measures the directional alignment between the actual and optimal transitions over a fixed temporal distance.
Specifically, given a state \( s_{\text{cur}} \), the optimal future state \( s_{\text{opt}} \) is defined as the state observed after a temporal distance of \( H_{\text{TD}} \) within the same trajectory, and the actual reached state \( s_{\text{reached}} \) is observed \( H_{\text{TD}} \) steps ahead.
TE is then computed via cosine similarity in TDR space: \(\theta_{\text{TE}} = \cos\bigl(\psi(s_{\text{opt}})-\psi(s_{\text{cur}}),\,\psi(s_{\text{reached}})-\psi(s_{\text{cur}})\bigr)\)

TD-aware Graph Construction

GAS clusters states in the TDR space at regular temporal distance intervals \( H_{\text{TD}} \), grouping semantically similar states from different trajectories.
Each cluster center becomes a graph node, and edges are added between nodes if their temporal distance is below \( H_{\text{TD}} \), enabling stitching across disconnected trajectories.

TD-aware Subgoal Sampling

A subgoal \( s_{\text{sub}} \) is selected based on a fixed temporal distance \( H_{\text{TD}} \) within the same trajectory.
To train the low-level policy, the selected subgoal is transformed into a direction vector \( \vec{h}_{\text{dir}} = \operatorname{dir}(\psi(s_t), \psi(s_{\text{sub}})) \), and used in the low-level policy objective: \(\displaystyle \mathbb{E}_{\mathcal{D}} \Bigl[ Q\!\bigl(s_t,\mu^\pi(s_t,\vec{h}_{\text{dir}}),\vec{h}_{\text{dir}}\bigr) +\, \alpha\,\log \pi\!\bigl(a \mid s_t,\vec{h}_{\text{dir}}\bigr) \Bigr] \)

Experiments

Datasets

We evaluate GAS on OGBench (Park et al., 2025) and D4RL (Fu et al., 2020), spanning diverse dataset types such as locomotion, stitching, exploratory, and manipulation.

Results on state-based environments

Results on pixel-based environments

Performance Highlights

Q. Does GAS excel at long-horizon reasoning?
Yes, GAS shows strong performance on antmaze-giant-navigate and scene-play, which require substantial long-horizon reasoning capabilities in navigation and manipulation domains, respectively.

Q. Does GAS demonstrate effective stitching ability?
Yes, GAS consistently outperforms baselines on antmaze-{medium, large, giant}-stitch, where the datasets consist of short goal-reaching trajectories.

Q. Can GAS effectively learn from suboptimal datasets?
Yes, GAS achieves the best performance on antmaze-{medium, large}-explore, where the datasets consist of extremely low-quality data.

Q. Can GAS effectively handle image-based tasks?
Yes, GAS demonstrates strong performance not only in state-based environments but also in pixel-based environments.

Subgoal Visualizations

The graph is constructed in a latent representation space, but for visualization purposes, each node embedding is projected onto a 2D plane using approximate 2D coordinates (i.e., x-y position).

BibTeX

    @inproceedings{gas_baek2025,
    title={Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning},
    author={Seungho Baek and Taegeon Park and Jongchan Park and Seungjun Oh and Yusung Kim},
    booktitle={International Conference on Machine Learning (ICML)},
    year={2025},
    }

Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

ICML 2025