Prologue

Alright so, Motion Planning Network (MPN) and Motion Planning Transformer (MPT), this was actually a topic assigned to me for exploration (maybe I’ll share more about this in the future), not really my own motivation, so I researched them, and I do believe I have interesting findings for this post.

But first, the mandatory stuff, which is reading the papers, extracting insights, good old literature review. In this post, I will be covering the MPNet paper: Motion Planning Networks, Qureshi et al., 2019.

I will link the Motion Planning Transformers post here (coming soon).

I also recently setup Obsidian to try out this paper review workflow, this post was originally drafted in Obsidian, so the formatting might be a bit different from my usual posts.


Table of Content

Block

1. Recap

Motion planning is one of the main pillars for modern robotics. Such algorithms however, increases exponentially in complexity as dimensionality of the problem increases.

The team proposed a neural-network based framework called MPNet to tackle the problem. They tested it on various 2D and 3D environments, everything from a simple point-mass robot in a 2D random forrest to a 7DOF Baxter robot.

Baxter robot figure 1

MPNet is essentially: 1 contrative autoencoders (ENet) feeding into a dense deep neural network (PNet). They train these 2 nets separately by generating synthetic workspaces, ENet is trained with reconstruction loss, while PNet is trained with MSE loss.

For the Baxter robot case, interestingly they trained the entire ENet -> PNet end-to-end with MSE loss and a shallower PNet. For ENet, they also ditched the Decoder, its a 4 layers DNN, kinda weird seeing a 4L DNN feeding into a 9L DNN but oh well.

It did worked out well for them, they state MPNet’s computation time is at least 20x faster against traditional Sampling-based Motion Planners (SMP).

MPNet doodle

My personal take is that, this paper feels kinda shaky. It feels like they run into some sort of roadblock with the Baxter robot setup (lets refer to this as Baxter MPNet) and the deadline was closing up or something.

They don’t really make justification for changing the architecture and training strategy, and their results showcase for the Baxter case is just a few lines at the end of the results section instead of a more thorough A/B/n test like they did with their base MPNet.

Another thing I found intriguing is their novel bidirectional stochastic algorithm as a heuristic for MPNet. The authors state this make the path generation heuristic “greedy and fast”.


2. Problem Definition

Let $X \subset \mathbb{R}^d$ be a given state space, $d \in \mathbb{N}$ would be 2 in $\mathbb{R}^2$ (2D space) and 3 in $\mathbb{R}^3$ (3D space).

The Baxter robot operates in $SE(3)$ space (Special Euclidian Group). Consisting of $(x, y, z)$ and rotation $(\text{roll, pitch, yaw})$, the 1 extra DOF in Baxter’s 7 DOF is called redundancy, allowing it even more freedom in the space especially when obstacles are involved.

The obstacle and obstacle-free state spaces are defined as $X_{\text{obs}} \subset X$ and $X_{\text{free}} = X\setminus X_{\text{obs}}$, respectively.

  • Init state: $x_{\text{init}} \in X_{\text{free}}$
  • Goal region: $X_{\text{goal}} \subset X_{\text{free}}$
  • Solution path: $\tau = {x_0,x_1,\ldots,x_T}$ where $x_T \in X_{\text{goal}}$ and $x_i \in X_{\text{free}} : i = [0, T-1 ]$

3. Model Architecture

General structure: ENet and PNet.

Two phases: (A) offline training and (B) online path generation.

3.1. Base MPNet

MPNet doodle

ENet is a Contrative Autoencoder (CAE), where $\theta^e, \theta^d$ are the parameters of the encoder and decoder, respectively.

The encoding function $\text{ENet}(x_{\text{obs}})$ have 3 linear layers and 1 output layer, each linear layer is followed by the Parametric Rectified Linear Unit ($PReLU$). Input dim is a vector of point clouds of size $N_{\text{pc}} \times d_{w}$ where $N_{\text{pc}}$ is the number of data points and $d_{w} \in \mathbb{N}$ is the dimension of a workspace. Output of the encoder is $Z \in \mathbb{R}^m$ where $m \in \mathbb{N}$ is a user chosen parameter.

PReLU function

ReLU vs. PReLU. Source: Wang et al,. 2022

The decoder is simply the inverse of the encoder.

ENet is trained with reconstruction loss:

LAE(θe,θd)=1NobsxDobsxx^2+λij(θije)2L_{\text{AE}}(\theta^e, \theta^d) = \frac{1}{N_{\text{obs}}} \sum_{\boldsymbol{x} \in D_{\text{obs}}} \|\boldsymbol{x} - \hat{\boldsymbol{x}}\|^2 + \lambda \sum_{ij} (\theta_{ij}^e)^2
Block

Where $\lambda$ is a penalizing coefficient of user’s choosing.

After training ENet, the decoder is stripped away leaving only the encoder responsible for embedding the obstacles point cloud into $Z$.

ENet is frozen after training, then we train PNet separately, its a feed-forward deep neural network (DNN), parameterized by $\theta$ (12 layers).

Each linear layer is followed by the Parametric Rectified Linear Unit ($PReLU$). For all hidden layers in PNet, Dropout is applied with probability of $p= 0.5$. Dropout is always active despite of training or inference.

Given $Z’=\text{concat}(Z, x_{t}, x_{T})$, PNet predicts x^t+1Xfree\hat{x}_{t+1} \in X_{\text{free}}. PNet is trained with MSE loss against the human expert / classic planner feasible, near-optimal paths $\tau^*$:

LPnet(θ)=1NpjN^i=0T1x^j,i+1xj,i+12L_{\text{Pnet}}(\boldsymbol{\theta}) = \frac{1}{N_p} \sum_{j}^{\hat{N}} \sum_{i=0}^{T-1} \|\hat{\boldsymbol{x}}_{j,i+1} - \boldsymbol{x}_{j,i+1}\|^2
Block

3.2. Baxter MPNet

MPNet doodle

For Baxter MPNet, ENet is just an encoder, in this case, its just a DNN, so we essentially have 2 DNN in this MPNet.

What’s more, PNet in Baxter MPNet is also shallower than Base MPNet with 9 layers instead of 12.

They trained Baxter MPNet end-to-end with MSE loss.

Other than that, everything else is relatively similar to Base MPNet, still $PReLU$ activation, same Dropout, same concatenation wtih $Z’$, …


4. Online Path Generation Algorithm

The online phase of MPNet performs motion planning by combining two learned components:

  • Enet encodes the obstacle point cloud into a latent obstacle representation.
  • Pnet incrementally predicts intermediate states between a start and goal configuration. (autoregressive-like)

The planner operates in three stages:

  1. Generate a coarse path using a bidirectional neural expansion strategy.
  2. Simplify the path through lazy contraction.
  3. Repair invalid path segments using recursive replanning.

algorithm MPNet

The overall procedure is outlined below.


4.1. Overall Planning Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
function MPNet(x_start, x_goal, x_obs):

    # Encode obstacle space
    z = Enet(x_obs)

    # Generate coarse path
    path = NeuralPlanner(x_start, x_goal, z)

    if path is None:
        return Failure

    # Simplify path
    path = LazyStateContraction(path)

    # Validate full trajectory
    if IsFeasible(path):
        return path

    # Repair invalid segments
    repaired_path = Replan(path, z)

    if repaired_path is None:
        return Failure

    repaired_path = LazyStateContraction(repaired_path)

    if IsFeasible(repaired_path):
        return repaired_path

    return Failure
Python

4.2. Obstacle Encoding

The encoder network transforms the obstacle point cloud into a compact latent representation used throughout planning.

1
z = Enet(x_obs)
Python

Where:

  • x_obs is the obstacle point cloud
  • z is the latent obstacle embedding

This embedding remains fixed during the online planning phase.


4.3. Bidirectional Neural Path Generation

The neural planner incrementally expands two trajectories, one from the start state, one from the goal state.

At each iteration, one side predicts a new state using Pnet, then attempts to directly connect to the opposite frontier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function NeuralPlanner(x_start, x_goal, z):

    path_a = [x_start]
    path_b = [x_goal]

    for iteration in range(max_iterations):

        # Predict next state
        x_new = Pnet(z, path_a[-1], path_b[-1])

        path_a.append(x_new)

        # Attempt direct connection
        if SteerTo(path_a[-1], path_b[-1]):

            return Concatenate(path_a, path_b)

        # Alternate expansion direction
        Swap(path_a, path_b)

    return Failure
Python

4.3.1. Steering Function

The steering procedure checks whether two states can be connected through a collision-free linear interpolation.

1
2
3
4
5
6
7
8
9
10
function SteerTo(x1, x2):

    for δ in [0, 1]:

        x = (1 - δ) * x1 + δ * x2

        if InCollision(x):
            return False

    return True
Python

4.3.2. Lazy State Contraction

The generated neural path may contain unnecessary intermediate states. Lazy state contraction removes redundant states by attempting direct shortcuts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
function LazyStateContraction(path):

    contracted = [path[0]]

    i = 0

    while i < len(path) - 1:

        j = len(path) - 1

        while j > i + 1:

            if SteerTo(path[i], path[j]):
                break

            j -= 1

        contracted.append(path[j])

        i = j

    return contracted
Python

This procedure essentially acts like a lightweight path smoothing stage.


4.3.3. Path Feasibility Checking

A path is considered feasible if all consecutive state pairs are collision-free.

1
2
3
4
5
6
7
8
function IsFeasible(path):

    for i in range(len(path) - 1):

        if not SteerTo(path[i], path[i + 1]):
            return False

    return True
Python

4.4. Recursive Replanning

If the coarse path contains invalid segments, replanning is applied locally between non-connectable states.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function Replan(path, z):

    repaired = []

    for i in range(len(path) - 1):

        x_a = path[i]
        x_b = path[i + 1]

        # Segment already valid
        if SteerTo(x_a, x_b):

            repaired.append((x_a, x_b))
            continue

        # Generate local repair path
        local_path = LocalReplanner(x_a, x_b, z)

        if local_path is None:
            return Failure

        repaired.extend(local_path)

    return repaired
Python

The replanner can operate in two modes:

  • Neural replanning: Recursively applies the neural planner on invalid local segments.
  • Hybrid replanning: Falls back to a classical planner such as RRT* when neural replanning fails.

4.4.1. Neural Replanning

Neural replanning recursively refines invalid path segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function NeuralReplanner(x_start, x_goal, z, depth):

    if depth > max_depth:
        return Failure

    path = NeuralPlanner(x_start, x_goal, z)

    if path is None:
        return Failure

    if IsFeasible(path):
        return path

    return Replan(path, z)
Python

4.4.2. Hybrid Replanning

Hybrid replanning integrates classical motion planning into the repair stage.

1
2
3
4
5
6
7
8
9
function HybridReplanner(x_start, x_goal, z):

    path = NeuralReplanner(x_start, x_goal, z)

    if path is feasible:
        return path

    # Fallback to classical planner
    return RRTStar(x_start, x_goal)
Python

4.5. Discussion

4.5.1. Stochasticity Through Dropout

Dropout is applied during both training and inference.

1
x_next = Pnet(z, x_current, x_goal)
Python

During execution, random hidden units are disabled, causing the planner to behave stochastically rather than deterministically. Consequently, repeated planning attempts can generate different trajectories for the same start and goal states. This is particularly useful during recursive replanning, where alternative paths may recover from previous failures.


4.5.2. Completeness

The neural planner alone does not provide completeness guarantees. Completeness instead depends on the replanning strategy. Pure neural replanning remains heuristic, while hybrid replanning inherits the probabilistic completeness of the underlying classical planner, such as RRT*.


4.5.3. Computational Complexity

The neural components operate with approximately constant-time inference complexity.

1
2
3
4
Enet              : O(1)
Pnet              : O(1)
Neural planning   : O(1)
Neural replanning : O(1)
Text

(Neural networks are considered $O(1)$)

If hybrid replanning invokes a classical planner such as RRT*, the worst-case complexity becomes:

1
RRT* replanning : O(n log n)
Text

In practice, they find that classical replanning is only applied to small invalid path segments rather than the full trajectory.


5. Experiments & Results

5.1. Experimental Setup

The authors implemented the proposed framework in PyTorch, while the benchmark planners, Informed-RRT* and BIT, were evaluated using Python and OMPL implementations. For the Baxter experiments, they integrated the system with ROS and MoveIt!, and compared their method against the C++ implementation of BIT. Their experiments were conducted on a machine equipped with an Intel Core i7 CPU, 32GB RAM, and an NVIDIA GTX 1080 GPU.

Their evaluation focuses on comparing MPNet with Neural Replanning (MPNet: NR) and Hybrid Replanning (MPNet: HR) against classical sampling-based planners across both low-dimensional and high-dimensional planning tasks.


5.2. Base Case(s)

5.2.1. Dataset Generation

The authors evaluated their method across four planning environments: simple 2D (s2D), complex 2D (c2D), complex 3D (c3D), and rigid-body planning.

For each environment category, they generated 110 workspaces. Within every workspace, they collected 5000 collision-free near-optimal trajectories using RRT*. Their training dataset consisted of 100 workspaces with 4000 trajectories per workspace.

They constructed two separate evaluation datasets. The first, referred to as $X_\text{obs:seen}$, reused previously observed workspaces while introducing unseen start and goal configurations. The second, $X_\text{obs:unseen}$, consisted entirely of unseen workspaces and unseen planning queries.


5.2.2. Quantitative Results

The authors compared the computational time of MPNet: NR and MPNet: HR against Informed-RRT* and BIT*. Their reported metric corresponds to the time required to compute an initial feasible trajectory.

Environment Test Case MPNet (NR) MPNet (HR) Informed-RRT* BIT* BIT* : t_mean / MPNet(NR) : t_mean
Simple 2D Seen Xobs 0.11 ± 0.037 0.19 ± 0.14 5.36 ± 0.34 2.71 ± 1.72 24.64
Simple 2D Unseen Xobs 0.11 ± 0.038 0.34 ± 0.21 5.39 ± 0.18 2.63 ± 0.75 23.91
Complex 2D Seen Xobs 0.17 ± 0.058 0.61 ± 0.35 6.18 ± 1.63 3.77 ± 1.62 22.17
Complex 2D Unseen Xobs 0.18 ± 0.27 0.68 ± 0.41 6.31 ± 0.85 4.12 ± 1.99 22.89
Complex 3D Seen Xobs 0.48 ± 0.10 0.34 ± 0.14 14.92 ± 5.39 8.57 ± 4.65 17.85
Complex 3D Unseen Xobs 0.44 ± 0.107 0.55 ± 0.22 15.54 ± 2.25 8.86 ± 3.83 20.14
Rigid Seen Xobs 0.32 ± 0.28 1.92 ± 1.30 30.25 ± 27.59 11.10 ± 5.59 34.69
Rigid Unseen Xobs 0.33 ± 0.13 1.98 ± 1.85 30.38 ± 12.34 11.91 ± 5.34 36.09

Across all environment categories, their method maintained relatively stable inference time, typically remaining near one second regardless of planning dimensionality. In contrast, the execution time of Informed-RRT* and BIT* increased as the environments became more complex, particularly in rigid-body planning tasks.

They also observed that MPNet generalized beyond the environments used during training. Similar execution behavior was reported on both the $X_\text{obs:seen}$ and $X_\text{obs:unseen}$ datasets, suggesting that their learned planner retained performance in previously unseen obstacle configurations.

They however do not imply generalization across modality, for example, training in 2D and generalizing to 3D.

Examples of trajectories generated by MPNet and RRT* are shown below.

MPNet vs RRT star


5.3. Baxter Case

5.3.1. Dataset Generation

For the Baxter experiments, the authors created ten simulated manipulation environments. In each environment, they collected 900 trajectories for training and 100 trajectories for testing.

They obtained obstacle point clouds using a Kinect depth camera together with PCL and ROS integration. Their planning task focused on generating collision-free trajectories for the 7-DOF Baxter manipulator between target joint configurations.

Baxter robot figure 1


5.3.2. Quantitative Results

In the Baxter experiments, the authors compared MPNet against BIT* in a high-dimensional manipulation setting. Their method achieved approximately one-second planning time on average while maintaining a relatively high success rate across the test environments. BIT*, although capable of generating feasible trajectories, required substantially longer execution time, particularly when constrained to path quality comparable to MPNet.

Their results suggest that the neural planner remained computationally stable even in high-dimensional manipulation tasks, whereas the execution time of classical sampling-based planning varied more significantly.


6. Final Thoughts

I’m gonna put my full thoughts in the Motion Planning Transformers post here (coming soon). I already have some quick personal notes in 1. Recap.