Buckets:

huggingchat
/

papers-content

Files

xet

huggingchat/papers-content / 2204 /2204.05859.md

mishig

about 2 months ago

preview code

download

raw

73.9 kB

Title: Bootstrap Motion Forecasting With Self-Consistent Constraints

URL Source: https://arxiv.org/html/2204.05859

Published Time: Tue, 28 Nov 2023 02:12:38 GMT

Markdown Content: Maosheng Ye 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Jiamiao Xu 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Xunnong Xu 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Tengfei Wang 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Tongyi Cao 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Qifeng Chen 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT The Hong Kong University of Science and Technology 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT DeepRoute.AI

Abstract

We present a novel framework to bootstrap M otion forecast I ng with S elf-consistent C onstraints (MISC). The motion forecasting task aims at predicting future trajectories of vehicles by incorporating spatial and temporal information from the past. A key design of MISC is the proposed Dual Consistency Constraints that regularize the predicted trajectories under spatial and temporal perturbation during training. Also, to model the multi-modality in motion forecasting, we design a novel self-ensembling scheme to obtain accurate teacher targets to enforce the self-constraints with multi-modality supervision. With explicit constraints from multiple teacher targets, we observe a clear improvement in the prediction performance. Extensive experiments on the Argoverse motion forecasting benchmark and Waymo Open Motion dataset show that MISC significantly outperforms the state-of-the-art methods. As the proposed strategies are general and can be easily incorporated into other motion forecasting approaches, we also demonstrate that our proposed scheme consistently improves the prediction performance of several existing methods.

1 Introduction

Motion forecasting has been a crucial task for self-driving vehicles that aims at predicting the future trajectories of agents (e.g., cars, pedestrians) involved in traffic. The predicted trajectories can further help self-driving vehicles plan their future actions and avoid potential accidents. Since the future is not deterministic, motion forecasting is intrinsically a multi-modal problem with substantial uncertainties. This implies that an ideal motion forecasting method should produce a distribution of future trajectories or at least multiple most likely ones.

Due to the inherent uncertainty, motion forecasting remains challenging and unsolved yet. Recently, researchers have proposed different architectures based on various representations to encode the kinematic states and context information from HDMap in order to generate feasible multi-modal trajectories[2, 7, 15, 18, 28, 29, 36, 51, 55, 56, 58]. These methods follow a traditional static training pipeline, where frames of each scenario are split into historical frames (input) and future frames (ground truth) in a fixed pattern. Nevertheless, the prediction task is a streaming task in real-world applications, where the current state will become a historical state as time goes by, and the buffer of the historical state is a queue structure to make successive predicted trajectories. As a result, temporal consistency thus becomes a crucial requirement for the downstream tasks for fault and noise tolerance. To tackle this issue, trajectory stitching is widely applied in traditional planning algorithms[13] to ensure stability along the temporal horizon. However, as the trajectory stitching operation is non-differentiable, it cannot be easily incorporated into learning-based models. Though deep-learning-based models show unprecedented motion prediction performance compared with traditional counterparts, they do not explicitly consider temporal consistency, leading to unstable behaviors in downstream tasks such as planning.

Inspired by these phenomena, we raise a question: can we explicitly enforce consistency when training a deep motion prediction model? On the one hand, the predicted trajectories should be consistent given the successive inputs along the temporal horizon, namely temporal consistency. On the other hand, the predicted trajectories should be stable and robust against small spatial noise or disturbance, namely spatial consistency. In this work, we propose a self-supervised scheme, named as MISC, to enforce consistency constraints in both spatial and temporal domains, namely Dual Consistency Constraints. Dual Consistency Constraints could be viewed as an inner-model consistency and can significantly improve the quality and robustness of motion forecasting, without the need for extra data.

On top of the inner-model consistency, we also exploit the intra-model consistency. Multi-modality is another core characteristic of the motion prediction task. Existing datasets[9, 49] only provide a single ground-truth trajectory for each scenario, which can not satisfy multi-choice situations such as junction scenarios. Most methods adopt the winner-takes-all (WTA)[26] or its variants[4, 35] to alleviate this situation. However, WTA tends to produce confused predictions when two trajectories are very close. In contrast, our method addresses the multi-modality problem by using more robust teacher targets obtained from self-ensembling, which leverages intra-model consistency. Multiple teacher targets can be viewed as a special kind of intra-model distillation while alleviating the problem of multi-modality. Our contributions are summarized as follows,

•We propose self-consistent constraints in both intra and inner model aspects.
•For the inner-model consistency, Dual Consistency Constraints are proposed to enforce temporal and spatial consistency in our model, which is shown to be a general and effective way to improve the overall performance in motion forecasting.
•For the intra-model consistency constraints, a self-ensembling constraint is explicitly exploited to enforce self-consistency with teacher targets, which provides multi-modality supervision for training.
•Extensive experiments on the Argoverse[9] motion forecasting benchmark and Waymo Open Motion dataset[12] show that the proposed approach achieves state-of-the-art performance.

2 Related Work

Motion Forecasting. Traditional methods[22, 43, 54, 62] for motion forecasting mainly utilize HDMap information for the prior estimation and Kalman filter[23] for motion states prediction. With the recent progress of deep learning on big data, more and more works have been proposed to exploit the potential of data mining in motion forecasting. Early efforts[2, 7, 11, 15, 21, 28, 29, 44, 46, 55, 56, 61] explore different representations, including rasterized image, graph representation, point cloud representation and transformer to generate the features for the task and predict the final output trajectories by regression or post-processing sampling. Most of these works focus on finding more effective and compact ways of feature extraction on the surrounding environment (HDMap information) and agent interactions. Based on these representations, other approaches[6, 33, 46, 56, 57, 58] try to incorporate the prior knowledge with traditional methods, which take the predefined candidate trajectories from sampling or clustering strategies as anchor trajectories. To some extent, these candidate trajectories can provide better guidance and goal coverage for the trajectories regression due to straightforward HDMap encoding. Nevertheless, this extra dependency makes the stability of models highly related to the quality of the trajectory proposals. Goal-guided approaches[16, 18, 17] are therefore introduced to optimize goals in an end-to-end manner, paired with sampling strategies that generate the final trajectory for better coverage.

Consistency Regularization. Consistency regularization has been fully studied in semi-supervised and self-supervised learning. Temporally related works[53, 27, 60] apply pairwise matching to minimize the alignment difference through optical flow or correspondence matching to achieve temporal smoothness. Other works[1, 14, 38, 42, 52, 45] apply consistency constraints to predictions from the same input with different transformations in order to obtain perturbation-invariant representations. [8, 3] reverse the temporal order or mask some information and generate pairwise consistency between these predicted trajectories. [48] introduced consistency by examining the gap between agent-centric and scene-centric settings.

Multi-hypothesis Learning. Motion forecasting task inherently has multi-modality due to the future uncertainties and difficulties in acquiring accurate ground-truth labels. WTA[19, 47] in multi-choice learning and its variants[32, 41] incorporate with better distribution estimation to improve the training convergence, thus allowing more multi-modality. Some anchor-based methods[4, 7, 39, 56] introduce pre-defined anchors based on kinematics or road graph topology to provide guidance. However, these methods only allow one target per training stage. Other methods[4, 18] try to generate multi-target for supervision with heavy handcrafted optimizations. We propose a Teacher-Target-Constraints approach to provide more precise trajectory teacher labels by leveraging the power of self-ensembling[25, 59]. Multiple targets are explicitly provided to each agent to better model the multi-modality.

3 Approach

The overall architecture of MISC comprises three parts. 1) We first utilize a joint spatial and temporal learning framework TPCN[55] to extract pointwise features. Based on these features, we decouple the trajectory prediction problem as a two-stage regression task. The first stage performs goal prediction and completes the trajectory with the goal position guidance. The second stage takes the output of the first stage as anchor trajectories for refinement. 2) To enhance the spatial and temporal consistency of our MISC, we introduce Dual Consistency Constraints at the inner-model level, which helps regularize the predictions in a streaming task view. 3) We leverage self-ensembling to generate more precise teacher targets to provide intra-model level self-consistent Teacher Targets Constraints in Sec.3.3

3.1 Architecture

Figure 1: The overall architecture. We utilize TPCN[55] as a feature extraction backbone to model the spatial and temporal relationship among agents and map information. A goal prediction header is then used to regress the possible goal candidates; with the goal position, we apply trajectory completion to obtain full trajectories; finally, the trajectories are refined based on the output of the trajectory completion module as anchor trajectories.

Recently, TPCN[55] has gained popularity in this task due to its flexibility for joint spatial-temporal learning and scalability to adopt more techniques from point cloud learning. Considering its limitation in representing future uncertainty, we extend TPCN in a two-stage manner through goal position prediction for more accurate waypoints prediction as our baseline. The pipeline is shown in Fig.1.

Feature Extraction: TPCN utilizes dual-representation point cloud learning techniques with multi-interval temporal learning to model the spatial and temporal relationship. All the historical trajectories of input agents and map information are based on pointwise representation {𝐩 1,𝐩 2,…,𝐩 N}subscript 𝐩 1 subscript 𝐩 2…subscript 𝐩 𝑁\left{{{{\bf{p}}{1}}},{{{\bf{p}}{2}}},\ldots,{{{\bf{p}}{N}}}\right}{ bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where 𝐩 i subscript 𝐩 𝑖{{\bf{p}}{i}}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i 𝑖 i italic_i-th point with N 𝑁 N italic_N points in total, and then go through multi-representation learning framework to generate pointwise features 𝒫∈R N×C 𝒫 superscript 𝑅 𝑁 𝐶\mathcal{P}\in{R^{N\times C}}caligraphic_P ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, where C 𝐶 C italic_C is the channel number.

Goal Prediction: With the pointwise features from the backbone, we also adopt the popular goal-based ideas[16, 18, 58] to find the optimal planning policy. Specifically, we first gather all corresponding pointwise agent features and then sum over features to get the agent instance feature ϕ∈R 1×C italic-ϕ superscript 𝑅 1 𝐶\phi\in{R^{1\times C}}italic_ϕ ∈ italic_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT. To generate K 𝐾 K italic_K goal position prediction G={G k:(g x k,g y k)|1≤k≤K}𝐺 conditional-set superscript 𝐺 𝑘 conditional superscript subscript 𝑔 𝑥 𝑘 superscript subscript 𝑔 𝑦 𝑘 1 𝑘 𝐾 G={G^{k}:(g_{x}^{k},g_{y}^{k})|1\leq k\leq K}italic_G = { italic_G start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT : ( italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) | 1 ≤ italic_k ≤ italic_K }, we use a simple MLP layer: G=M⁢L⁢P⁢(ϕ)𝐺 𝑀 𝐿 𝑃 italic-ϕ G=MLP(\phi)italic_G = italic_M italic_L italic_P ( italic_ϕ ). Instead of relying on heavy sampling strategies like previous goal-based methods, our method avoids generating extra proposals, which may lead to a large computation overhead.

Trajectory Completion: With the predicted goal positions, we need to complete each trajectory conditioned on these goals. We propose a simple trajectory completion module to generate K 𝐾 K italic_K full trajectories {τ r⁢e⁢g k|1≤k≤K}conditional-set superscript subscript 𝜏 𝑟 𝑒 𝑔 𝑘 1 𝑘 𝐾\left{\tau_{reg}^{k}|1\leq k\leq K\right}{ italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | 1 ≤ italic_k ≤ italic_K } with a single MLP layer as follows:

τ r⁢e⁢g k={(x 1 k,y 1 k),(x 2 k,y 2 k),…,(x T k,y T k)},superscript subscript 𝜏 𝑟 𝑒 𝑔 𝑘 superscript subscript 𝑥 1 𝑘 superscript subscript 𝑦 1 𝑘 superscript subscript 𝑥 2 𝑘 superscript subscript 𝑦 2 𝑘…superscript subscript 𝑥 𝑇 𝑘 superscript subscript 𝑦 𝑇 𝑘\displaystyle\tau_{reg}^{k}={(x_{1}^{k},y_{1}^{k}),(x_{2}^{k},y_{2}^{k}),% \dots,(x_{T}^{k},y_{T}^{k})},italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } ,(1) τ r⁢e⁢g k=M⁢L⁢P⁢(c⁢o⁢n⁢c⁢a⁢t⁢(ϕ,G k)).superscript subscript 𝜏 𝑟 𝑒 𝑔 𝑘 𝑀 𝐿 𝑃 𝑐 𝑜 𝑛 𝑐 𝑎 𝑡 italic-ϕ superscript 𝐺 𝑘\displaystyle\tau_{reg}^{k}=MLP(concat(\phi,G^{k})).italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_M italic_L italic_P ( italic_c italic_o italic_n italic_c italic_a italic_t ( italic_ϕ , italic_G start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) .(2)

Trajectory Refinement: Inspired by Faster-RCNN[40] and Cascade-RCNN[5], we use the output trajectories from the Trajectory Completion as anchor trajectories to refine trajectories and predict the corresponding possibility of each trajectory. In particular, the input of the trajectory refinement module will be the whole trajectory with agent historical waypoints τ h⁢i⁢s⁢t⁢o⁢r⁢y subscript 𝜏 ℎ 𝑖 𝑠 𝑡 𝑜 𝑟 𝑦\tau_{history}italic_τ start_POSTSUBSCRIPT italic_h italic_i italic_s italic_t italic_o italic_r italic_y end_POSTSUBSCRIPT. With a residual block followed by a linear layer R⁢e⁢g 𝑅 𝑒 𝑔 Reg italic_R italic_e italic_g and C⁢l⁢s 𝐶 𝑙 𝑠 Cls italic_C italic_l italic_s respectively, we regress the delta offset to the first stage outputs Δ τ r⁢e⁢g=R⁢e⁢g⁢(τ r⁢e⁢g,τ h⁢i⁢s⁢t⁢o⁢r⁢y)subscript Δ subscript 𝜏 𝑟 𝑒 𝑔 𝑅 𝑒 𝑔 subscript 𝜏 𝑟 𝑒 𝑔 subscript 𝜏 ℎ 𝑖 𝑠 𝑡 𝑜 𝑟 𝑦\Delta_{\tau_{{reg}}}=Reg(\tau_{reg},\tau_{history})roman_Δ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_R italic_e italic_g ( italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h italic_i italic_s italic_t italic_o italic_r italic_y end_POSTSUBSCRIPT ) and corresponding scores τ c⁢l⁢s={c k|1≤k≤K}subscript 𝜏 𝑐 𝑙 𝑠 conditional-set superscript 𝑐 𝑘 1 𝑘 𝐾\tau_{cls}=\left{c^{k}|1\leq k\leq K\right}italic_τ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = { italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | 1 ≤ italic_k ≤ italic_K } respectively, where τ c⁢l⁢s=C⁢l⁢s⁢(τ r⁢e⁢g,τ h⁢i⁢s⁢t⁢o⁢r⁢y)subscript 𝜏 𝑐 𝑙 𝑠 𝐶 𝑙 𝑠 subscript 𝜏 𝑟 𝑒 𝑔 subscript 𝜏 ℎ 𝑖 𝑠 𝑡 𝑜 𝑟 𝑦\tau_{{cls}}=Cls(\tau_{reg},\tau_{history})italic_τ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = italic_C italic_l italic_s ( italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h italic_i italic_s italic_t italic_o italic_r italic_y end_POSTSUBSCRIPT ). The final output trajectories will be τ r⁢e⁢g′=Δ τ r⁢e⁢g+τ r⁢e⁢g subscript 𝜏 𝑟 𝑒 superscript 𝑔′subscript Δ subscript 𝜏 𝑟 𝑒 𝑔 subscript 𝜏 𝑟 𝑒 𝑔\tau_{reg^{{}^{\prime}}}=\Delta_{\tau_{{reg}}}+\tau_{reg}italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT.

3.2 Dual Consistency Constraints

Consistency regularization has been proved as an effective self-constraint that improves robustness against disturbances. We thus propose inner-model level Dual Consistency Constraints in both spatial and temporal domains to align predicted trajectories for continuity and stability.

3.2.1 Temporal Consistency

Figure 2: The overall idea of the temporal consistency. In the training stage, we first generate output prediction trajectory points as normal for each given scenario. Then we slide the input with a step in order to introduce the streaming nature to generate the consecutive output trajectory points. The proposed temporal consistency requires the overlap between these two outputs to be consistent

In motion forecasting, since each scenario contains multiple successive frames within a fixed temporal chunk, it is reasonable to assume that any two overlapping chunks of input data with a small time shift should produce consistent results. The motion forecasting task aims to predict K 𝐾 K italic_K possible trajectories with T 𝑇 T italic_T time steps for one scenario, given M 𝑀 M italic_M frames historical information. Suppose the information at each history frame is I i subscript 𝐼 𝑖 I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where 1≤i≤M 1 𝑖 𝑀 1\leq i\leq M 1 ≤ italic_i ≤ italic_M and the k 𝑘 k italic_k-th output future trajectories are {(x i k,y i k)|M<i≤M+T}conditional-set subscript superscript 𝑥 𝑘 𝑖 subscript superscript 𝑦 𝑘 𝑖 𝑀 𝑖 𝑀 𝑇\left{(x^{k}{i},y^{k}{i})|M<i\leq M+T\right}{ ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_M < italic_i ≤ italic_M + italic_T }. We first apply time step shift s 𝑠 s italic_s for the input for temporal consistency. Therefore, the input history frames information will be {I i|1+s≤i≤M+s}conditional-set subscript 𝐼 𝑖 1 𝑠 𝑖 𝑀 𝑠\left{I_{i}|1+s\leq i\leq M+s\right}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | 1 + italic_s ≤ italic_i ≤ italic_M + italic_s } and then we apply the same network for the shifted history information with surrounding HDMap information to generate the k 𝑘 k italic_k-th output trajectories {(x i′⁣k,y i′⁣k)|M+s<i≤M+s+T}conditional-set subscript superscript 𝑥′𝑘 𝑖 subscript superscript 𝑦′𝑘 𝑖 𝑀 𝑠 𝑖 𝑀 𝑠 𝑇\left{({x^{\prime k}{i}},y^{\prime k}{i})|M+s<i\leq M+s+T\right}{ ( italic_x start_POSTSUPERSCRIPT ′ italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_M + italic_s < italic_i ≤ italic_M + italic_s + italic_T }. When s 𝑠 s italic_s is small, the driving intentions or behavior keeps stable for a short period. Since both trajectories have T−s 𝑇 𝑠 T-s italic_T - italic_s overlapping waypoints, they should be as close as possible and share consensus. Thus, we can construct self-constraints for a single scenario input due to the streaming property of the input data. Fig.2 demonstrates the overall idea of the temporal consistency constraint.

Trajectory Matching: Since we predict K 𝐾 K italic_K future trajectories to deal with the multi-modality, it is crucial to consider the trajectory matching relationship between original predictions and time-shifted predictions when applying the temporal consistency alignment. For a matching problem, the metric on similarity criteria and matching strategies will be two key factors. Several ways can be used to measure the difference between trajectories, such as Average Displacement Error (ADE) and Final Displacement Error (FDE). We utilize FDE as the criteria since the last position error can partially reflect the similarity with less bias from averaging compared with ADE.

Matching Strategy: There are roughly four ways used for matching, namely forward matching, backward matching, bidirectional matching, and Hungarian matching. Forward matching takes one trajectory in the current frame and finds its corresponding trajectory in the next frame with the least cost or maximum similarity. Backward matching is the reverse way compared to forward matching. Furtherly, bidirectional matching consists of both forward and backward matching, which considers the dual relationship. Hungarian matching is a linear optimal matching solution based on linear assignment. Forward and backward matching only considers the one-way situation, which is sensitive to noise and unstable. Hungarian matching has a high requirement for cost function choice. Based on these observations, we choose bidirectional matching as our strategy. We also show its advantages over the other approaches in Sec.4.3. After obtaining the optimal matching pairs {(m k,n k)|1≤k≤K}conditional-set subscript 𝑚 𝑘 subscript 𝑛 𝑘 1 𝑘 𝐾\left{(m_{k},n_{k})|1\leq k\leq K\right}{ ( italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | 1 ≤ italic_k ≤ italic_K }, we can compute the consistency constraint by a simple smooth L 1 subscript 𝐿 1 L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss[40]ℒ H⁢u⁢b⁢e⁢r subscript ℒ 𝐻 𝑢 𝑏 𝑒 𝑟\mathcal{L}_{Huber}caligraphic_L start_POSTSUBSCRIPT italic_H italic_u italic_b italic_e italic_r end_POSTSUBSCRIPT:

ℒ temp=∑k=1 K∑t=s+1 T ℒ Huber((x t m k,y t m k),(x n k′t−s,y n k′t−s)).\displaystyle\mathcal{L}{\textnormal{temp}}=\sum{k=1}^{K}\sum_{t=s+1}^{T}% \mathcal{L}{\textnormal{Huber}}((x{t}^{m_{k}},y_{t}^{m_{k}}),(x{{}^{\prime}}% {t-s}^{n{k}},y{{}^{\prime}}{t-s}^{n{k}})).caligraphic_L start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_s + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT Huber end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , ( italic_x start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) .(3)

3.2.2 Spatial Consistency

Since our MISC is a two-stage framework, the second stage mainly aims for trajectory refinement. It will be more convenient to add spatial permutation in the second stage with less computational cost. First, we apply spatial permutation function Z 𝑍 Z italic_Z, including flipping and random noise, to the trajectories from the first stage. The refinement module will process these augmented inputs and generate the offset to the ground truth and classification scores. Under the small spatial permutation and disturbance, we assume that the outputs of the network should also be self-consistent, meaning that the outputs have strong stability or tolerance to noise. Compared with data augmentation, it is explicit regularization. Then the spatial consistency constraint ℒ s⁢p⁢a subscript ℒ 𝑠 𝑝 𝑎\mathcal{L}_{spa}caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT is as follows:

ℒ spa=ℒ Huber(Δ τ r⁢e⁢g,Z−1(R e g(Z(τ r⁢e⁢g,τ history))).\displaystyle\vspace{-10px}\mathcal{L}{\textnormal{spa}}=\mathcal{L}{% \textnormal{Huber}}(\Delta_{\tau_{{reg}}},Z^{-1}(Reg(Z(\tau_{reg},\tau_{% \textnormal{history}}))).\vspace{-10px}caligraphic_L start_POSTSUBSCRIPT spa end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT Huber end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_R italic_e italic_g ( italic_Z ( italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT history end_POSTSUBSCRIPT ) ) ) .(4)

Then the total loss for Dual Consistency Constraints module will be ℒ c⁢o⁢n⁢s=ℒ s⁢p⁢a+ℒ t⁢e⁢m⁢p subscript ℒ 𝑐 𝑜 𝑛 𝑠 subscript ℒ 𝑠 𝑝 𝑎 subscript ℒ 𝑡 𝑒 𝑚 𝑝\mathcal{L}{cons}=\mathcal{L}{spa}+\mathcal{L}_{temp}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p end_POSTSUBSCRIPT.

3.3 Teacher-Target Constraints

Teacher-Target Constraints enforce intra-model consistency that not only leverages the power of knowledge distillation but also helps alleviate the multi-modality supervision problem. Existing datasets[9, 49] only provide a single ground-truth trajectory for the target agent, which is to be predicted in one scenario. In order to encourage the multi-modality of models, the winner-takes-all (WTA) strategy is commonly used to prevent the model from collapsing into a single domain. However, the WTA training strategy suffers from instability associated with network initialization. Some other approaches[4, 35] introduce robust estimation methods to select better hypotheses. To some extent, these methods can only implicitly model the multi-modality. Some other approaches[4, 58] generate several possible future trajectories based on the kinematics model and road graph topology. DenseTNT[18] only uses teacher labels to predict goal sets through a hill-climbing algorithm. These optimization methods tend to impose strict constraints and handcrafted prior knowledge, resulting in inaccurate teacher-targets and inferior performance. In contrast, our approach aims to generate more accurate teacher targets to provide explicit multi-modality supervision through self-ensembling to leverage the power of semi-supervised learning and knowledge distillation.

Figure 3: The overall procedure for the teacher-target generation. We obtain multiple predictions from outputs of different models for the target agents in each scenario; then we apply the K-means clustering algorithm to ensemble the trajectories

Teacher-Target Generation. The key part of our approach lies in generating more accurate teacher labels for each agent. However, it is straightforward to apply model ensembling techniques[20, 24, 50] to obtain more powerful predictions. Compared with previous works[4, 7, 58], we do not rely on handcrafted anchor trajectory sampling, which is based on inaccurate prior knowledge, including motion estimation. Meanwhile, soft targets from ensembling can better finetune the predictions and reduce the gradient variance for better training convergence. As suggested in works[10, 37], the prediction error decreases when the ensemble approach is used once the model is diverse enough. Therefore, we apply k-means algorithm[31] to the predicted trajectories that are collected within different training procedures (for example, launched with different seeds of random number generators, optimized with different learning rates, etc.) of MISC without Teacher-Target Constraints to generate J 𝐽 J italic_J trajectories with corresponding scores for each scenario. Fig.3 shows the overall process of our approach. Then with the original ground-truth label, we will formulate J+1 𝐽 1 J+1 italic_J + 1 target trajectories as follows:

τ c⁢o⁢n⁢f={c 0,c 1,…,c J},subscript 𝜏 𝑐 𝑜 𝑛 𝑓 subscript 𝑐 0 subscript 𝑐 1…subscript 𝑐 𝐽\displaystyle\tau_{conf}={c_{0},c_{1},\dots,c_{J}},italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } ,(5) τ t⁢g⁢t j={(x 1 t⁢g⁢t j,y 1 t⁢g⁢t j),(x 2 t⁢g⁢t j,y 2 t⁢g⁢t j),…,(x T t⁢g⁢t j,y T t⁢g⁢t j)},superscript subscript 𝜏 𝑡 𝑔 𝑡 𝑗 superscript subscript 𝑥 1 𝑡 𝑔 subscript 𝑡 𝑗 superscript subscript 𝑦 1 𝑡 𝑔 subscript 𝑡 𝑗 superscript subscript 𝑥 2 𝑡 𝑔 subscript 𝑡 𝑗 superscript subscript 𝑦 2 𝑡 𝑔 subscript 𝑡 𝑗…superscript subscript 𝑥 𝑇 𝑡 𝑔 subscript 𝑡 𝑗 superscript subscript 𝑦 𝑇 𝑡 𝑔 subscript 𝑡 𝑗\displaystyle\tau_{tgt}^{j}={(x_{1}^{tgt_{j}},y_{1}^{tgt_{j}}),(x_{2}^{tgt_{j% }},y_{2}^{tgt_{j}}),\dots,(x_{T}^{tgt_{j}},y_{T}^{tgt_{j}})},italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) } ,(6)

where τ t⁢g⁢t j superscript subscript 𝜏 𝑡 𝑔 𝑡 𝑗\tau_{tgt}^{j}italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the j 𝑗 j italic_j-th trajectory with score c j superscript 𝑐 𝑗 c^{j}italic_c start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, among J+1 𝐽 1 J+1 italic_J + 1 target trajectories. To simplify the notation, τ t⁢g⁢t 0 superscript subscript 𝜏 𝑡 𝑔 𝑡 0\tau_{tgt}^{0}italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the ground-truth trajectory with c 0 subscript 𝑐 0 c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT set to 1 1 1 1.

3.4 Learning

The total supervision of our MISC can be decoupled into several parts, as described in previous sections. For the regression and classification parts, we loop over all the possible J+1 𝐽 1 J+1 italic_J + 1 targets τ t⁢g⁢t subscript 𝜏 𝑡 𝑔 𝑡\tau_{tgt}italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT. For each target τ t⁢g⁢t j superscript subscript 𝜏 𝑡 𝑔 𝑡 𝑗\tau_{tgt}^{j}italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT with confidence τ c⁢o⁢n⁢f j superscript subscript 𝜏 𝑐 𝑜 𝑛 𝑓 𝑗\tau_{conf}^{j}italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, we apply WTA strategy as described in Sec.3.3. Suppose ksuperscript 𝑘 k^{}italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-th trajectory from trajectory refinement output τ r⁢e⁢g′subscript 𝜏 𝑟 𝑒 superscript 𝑔′\tau_{reg^{{}^{\prime}}}italic_τ start_POSTSUBSCRIPT italic_r italic_e italic_g start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the best trajectory which has the maximum similarity with target τ t⁢g⁢t j superscript subscript 𝜏 𝑡 𝑔 𝑡 𝑗\tau_{tgt}^{j}italic_τ start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, the classification loss and regression loss are defined as:

ℒ c⁢l⁢s j=1 K⁢∑k=1 K τ c⁢o⁢n⁢f j⁢ℒ H⁢u⁢b⁢e⁢r⁢(c k,c k*),subscript superscript ℒ 𝑗 𝑐 𝑙 𝑠 1 𝐾 superscript subscript 𝑘 1 𝐾 superscript subscript 𝜏 𝑐 𝑜 𝑛 𝑓 𝑗 subscript ℒ 𝐻 𝑢 𝑏 𝑒 𝑟 superscript 𝑐 𝑘 superscript 𝑐 superscript 𝑘\displaystyle\mathcal{L}^{j}{cls}=\frac{1}{K}\sum{k=1}^{K}\tau_{conf}^{j}% \mathcal{L}{Huber}(c^{k},c^{k^{}}),caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_H italic_u italic_b italic_e italic_r end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ,(7) ℒ r⁢e⁢g j=1 T⁢∑t=1 T τ c⁢o⁢n⁢f j⁢ℒ H⁢u⁢b⁢e⁢r⁢((x t k,y t k*),(x t t⁢g⁢t j,y t t⁢g⁢t j)).subscript superscript ℒ 𝑗 𝑟 𝑒 𝑔 1 𝑇 superscript subscript 𝑡 1 𝑇 superscript subscript 𝜏 𝑐 𝑜 𝑛 𝑓 𝑗 subscript ℒ 𝐻 𝑢 𝑏 𝑒 𝑟 superscript subscript 𝑥 𝑡 superscript 𝑘 superscript subscript 𝑦 𝑡 superscript 𝑘 superscript subscript 𝑥 𝑡 𝑡 𝑔 subscript 𝑡 𝑗 superscript subscript 𝑦 𝑡 𝑡 𝑔 subscript 𝑡 𝑗\displaystyle\mathcal{L}^{j}{reg}=\frac{1}{T}\sum_{t=1}^{T}\tau_{conf}^{j}% \mathcal{L}{Huber}((x{t}^{k^{}},y_{t}^{k^{}}),(x_{t}^{tgt_{j}},y_{t}^{tgt_% {j}})).caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_H italic_u italic_b italic_e italic_r end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) .(8)

For classification loss design, we adopt the displacement prediction idea from TPCN[55] to alleviate the hard assignment phenomenon. As for converting the displacement into probability, we use the standard softmin function to distribute the scores. Since we have trajectory completion and refinement modules, the regression loss will be ℒ r⁢e⁢g=∑j=0 J(ℒ r⁢e⁢g j+ℒ Δ⁢r⁢e⁢g j)subscript ℒ 𝑟 𝑒 𝑔 superscript subscript 𝑗 0 𝐽 subscript superscript ℒ 𝑗 𝑟 𝑒 𝑔 subscript superscript ℒ 𝑗 Δ 𝑟 𝑒 𝑔\mathcal{L}{reg}=\sum{j=0}^{J}(\mathcal{L}^{j}{reg}+\mathcal{L}^{j}{\Delta% {reg}})caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Δ italic_r italic_e italic_g end_POSTSUBSCRIPT ), where ℒ Δ⁢r⁢e⁢g j subscript superscript ℒ 𝑗 Δ 𝑟 𝑒 𝑔\mathcal{L}^{j}{\Delta{reg}}caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Δ italic_r italic_e italic_g end_POSTSUBSCRIPT is the regression loss for the refinement module. The final loss is ℒ=ℒ r⁢e⁢g+ℒ c⁢l⁢s+ℒ c⁢o⁢n⁢s ℒ subscript ℒ 𝑟 𝑒 𝑔 subscript ℒ 𝑐 𝑙 𝑠 subscript ℒ 𝑐 𝑜 𝑛 𝑠\mathcal{L}=\mathcal{L}{reg}+\mathcal{L}{cls}+\mathcal{L}{cons}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_s end_POSTSUBSCRIPT.

4 Experiments

We conduct experiments on the Argoverse dataset[9], one of the largest publicly available motion forecasting datasets. We compare our MISC with other state-of-the-art methods. Furthermore, we provide ablation studies to evaluate the effectiveness and generalization ability of each proposed module and design experiments for some hyperparameter choices.

4.1 Experimental Setup

Dataset. Argoverse[9] provides more than 300K scenarios with rich HDMap information. For each scenario, objects are divided into three types: agent, AV and others, where “agent" is the object to be predicted. Moreover, each scenario contains 50 50 50 50 frames sampled at 10 10 10 10 Hz, meaning that the time interval between successive frames is 0.1 0.1 0.1 0.1 s. The whole dataset is split into training, validation, and test sets, with 205,942 205 942 205,942 205 , 942, 39,472 39 472 39,472 39 , 472, and 78,143 78 143 78,143 78 , 143 sequences, respectively. Waymo open motion dataset (WOD) contains multiple types of agents including vehicles, pedestrians, and cyclists. A total of more than 100,000 segments are provided with more than 1500 km of roadway coverage.

Metrics. We use the standard evaluation metrics, including ADE and FDE. ADE is defined as the average displacement error between ground-truth trajectories and predicted trajectories over all time steps. FDE is defined as displacement error between ground-truth trajectories and predicted trajectories at the last time step. We predict K 𝐾 K italic_K candidate trajectories for each scenario and calculate the metrics with the ground truth labels. Accordingly, minADE and minFDE are minimum ADE and FDE over the top K 𝐾 K italic_K predictions. Moreover, miss rate (MR) is also considered, defined as the percentage of the best-predicted trajectories whose FDE is within a threshold (2 2 2 2 m). Brier-minFDE is the minFDE plus (1−p)2 superscript 1 𝑝 2(1-p)^{2}( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where p 𝑝 p italic_p is the corresponding trajectory probability. Metrics for K=1 𝐾 1 K=1 italic_K = 1 and K=6 𝐾 6 K=6 italic_K = 6 are used in our experiments. Note that Brier-minFDE 6 6{}_{6}start_FLOATSUBSCRIPT 6 end_FLOATSUBSCRIPT is the ranking metric.

Experimental Details. We apply some data augmentation, including random flipping with a probability of 0.5 0.5 0.5 0.5 and global random scaling with a scaling ratio between [0.8,1.25]0.8 1.25[0.8,1.25][ 0.8 , 1.25 ] during the training stage. As for model settings, the time shift s 𝑠 s italic_s for the temporal consistency constraint is set to 1 1 1 1. We adopt K=6 𝐾 6 K=6 italic_K = 6 to generate 6 6 6 6 trajectories and use J=6 𝐽 6 J=6 italic_J = 6 teacher targets for each scenario. Furthermore, we choose bidirectional matching for temporal consistency constraint. We finally use 10 10 10 10 models for ensembling due to computation resource limits. For more training details, we have included them in the supplementary materials.

Table 1: The detailed results of our MISC and other top-performing approaches on the Argoverse test set. And b-FDE 6 6{}{6}start_FLOATSUBSCRIPT 6 end_FLOATSUBSCRIPT is the abbreviation of brier-minFDE 6 6{}{6}start_FLOATSUBSCRIPT 6 end_FLOATSUBSCRIPT

4.2 Experimental Results

4.2.1 Results on Argoverse Dataset

Argoverse Leaderboard Results. We provide detailed quantitative results of our MISC on the Argoverse test set as well as public state-of-the-art methods in Tab.1. Compared with previous methods, our MISC improves all the evaluation metrics except MR 6 6{}_{6}start_FLOATSUBSCRIPT 6 end_FLOATSUBSCRIPT by a large margin. Furtherly, since the proposed modules are all general training components, other existing motion forecasting models can also benefit greatly from these strategies.

Figure 4: The past trajectory is in yellow, the predicted trajectory is in green, and the ground truth is in red. The top row of the figure shows the results without consistency, while the bottom row shows the results with consistency

Qualitative Results. We also present some qualitative results on the Argoverse validation set in Fig.4. Compared with results without consistency, the Dual Consistency Constraints improve both the quality and smoothness of the predicted trajectories significantly, resulting in more feasible and stable results despite the input noise.

4.2.2 Results on Waymo Open Motion Dataset

We provide some quantitative results on the validation set of the Waymo Open dataset motion prediction task[12], shown in Tab.2. Compared with KEMP[30] and SceneTransformer[36], we also achieve very promising results and show comparable improvement, demonstrating the effectiveness of our approach. We also provide some ablation studies on WOD in the supplementary materials.

Table 2: Quantitative results on the validation set of the Waymo Open dataset motion prediction task.

4.3 Ablation Studies

Table 3: Ablation study results of modules. Goal refers to Trajectory completion with goal prediction. “Ref.” is the trajectory refinement module, and the “Temp.” is temporal consistency. TTC refers to Teacher-Target Constraints during training

Table 4: Ablation study on matching factor for temporal consistency. In this experiment, we remove the Teacher-Target Constraints to fairly study the effect

Table 5: Ablation study results on the teacher target number J 𝐽 J italic_J

Table 6: Ablation study of consistency constraints and Teacher Target Constraints on different state-of-the-art methods on Argoverse validation set. Performance for methods without constraints is obtained from corresponding papers or our reproduction

Component Study. As shown in Tab.3, we conduct an ablation study for our MISC on the Argoverse validation set to evaluate the effectiveness of each proposed component. We adopt TPCN[55] as the baseline shown in the first row of Tab.3 and add the proposed components progressively. The architecture modifications from the goal set prediction and trajectory refinement module show their promising improvements of about 2 2 2 2%. Dual consistency Constraints have the largest improvements of more than 5 5 5 5% among all the evaluation metrics. Especially for minFDE 1 1{}_{1}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT, temporal consistency can optimize 20 20 20 20 cm, indicating the temporal constraints can improve both final position and trajectory probability prediction. Compared with temporal consistency, spatial consistency has less effect on models since we only enforce this constraint in the trajectory refinement stage. Finally, the Teacher-Target Constraints significantly increase performance, manifesting its effectiveness in helping training convergence.

Temporal Consistency Factors. We study the factors in the matching problems, including similarity and matching strategies. As shown in Tab.4, both Hungarian and Bidirectional matching show their advantages over the single direction matching. Although Hungarian matching can ensure the one-to-one matching relationship, it is sensitive to the similarity metric and numerical precision, both of which are not stable in the early training stage. In contrast, bidirectional matching with the FDE similarity metric nearly achieves the best results across all the evaluation metrics. Meanwhile, we also conduct experiments to find the best time-shift value s 𝑠 s italic_s in the temporal consistency. The details can be found in the supplement.

Figure 5: The L2 distance which varies with the time shift s 𝑠 s italic_s.

Reduction on Temporal Inconsistency. We use the average L2 distance among all predicted trajectory waypoints to measure the temporal consistency. As shown in Fig.5, our model without temporal consistency will have large inconsistency even though the time shift s 𝑠 s italic_s is small, which may lead to unstable behavior for the downstream task such as planning. With temporal consistency constraints, there is a significant improvement in the L2 distance divergence, demonstrating the effectiveness of our method.

Number of Teacher Targets. As shown in Tab.5, more teacher targets could bring better performance. Compared with J=1 𝐽 1 J=1 italic_J = 1, 6 6 6 6 teacher targets bring an extra nearly 1%percent 1 1%1 % improvements. However, the marginal improvement decreases significantly so we finally choose J=6 𝐽 6 J=6 italic_J = 6.

4.4 Generalization Capability

To verify the generalization capability of Dual Consistency Constraints and Teacher Targets Constraints, we also apply them to different models with state-of-the-art performance to show that they can be plugin-in training schemes.

Consistency Component. As shown in Tab.6, our dual consistency constraints can effectively improve the performance of models regardless of their representations through the training phase. There is a noticeable improvement of over 5%percent 5 5%5 % on every metric, especially for minFDE.

Teacher Target. Teacher-Target Constraints is another general training trick that can be widely used in other frameworks. In Tab.6, we also verify its effectiveness on other public methods. Methods with Teacher-Target Constraints have nearly over 3 3 3 3% improvement in all metrics. For the original DenseTNT[18], we replace its original handcrafted optimization for teacher goal targets with our self-ensembling teacher targets. This strategy brings an over 5 5 5 5% increase in performance, demonstrating the better quality of the self-ensembling teacher targets than handcrafted optimizations and estimation.

5 Conclusion

In this work, we propose MISC, an effective architecture for the motion forecasting task. We impose inner-model dual consistency regularization on both spatial and temporal domains to leverage the potential of self-supervision, which has been ignored by previous efforts. Besides, we explicitly model the multi-modality by providing supervision and constraints with powerful self-ensembling techniques in an intra-model aspect. Experimental results on the Argoverse motion forecasting dataset and Waymo dataset show the effectiveness of our approach and generalization capability to other methods.

References

[1] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in neural information processing systems, 27:3365–3373, 2014.
[2] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet. In Robotics: Science and Systems XV, 2019.
[3] Prarthana Bhattacharyya, Chengjie Huang, and Krzysztof Czarnecki. Ssl-lanes: Self-supervised learning for motion forecasting in autonomous driving. arXiv preprint arXiv:2206.14116, 2022.
[4] Antonia Breuer, Quy Le Xuan, Jan-Aike Termöhlen, Silviu Homoceanu, and Tim Fingscheidt. Quo vadis? meaningful multiple trajectory hypotheses prediction in autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 637–644. IEEE, 2021.
[5] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
[6] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In Conference on Robot Learning, pages 947–956, 2018.
[7] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019.
[8] Titas Chakraborty, Akshay Bhagat, and Henggang Cui. Improving motion forecasting for autonomous driving with the cycle consistency loss. arXiv preprint arXiv:2211.00149, 2022.
[9] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8748–8757, 2019.
[10] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
[11] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
[12] Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9710–9719, 2021.
[13] Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong. Baidu apollo em motion planner. arXiv preprint arXiv:1807.08048, 2018.
[14] Peter Földiák. Learning invariance from transformation sequences. Neural computation, 3(2):194–200, 1991.
[15] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11525–11533, 2020.
[16] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde. Home: Heatmap output for future motion estimation. arXiv preprint arXiv:2105.10968, 2021.
[17] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde. Gohome: Graph-oriented heatmap output for future motion estimation. In 2022 International Conference on Robotics and Automation (ICRA), pages 9107–9114. IEEE, 2022.
[18] Junru Gu, Chen Sun, and Hang Zhao. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15303–15312, 2021.
[19] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems, 25, 2012.
[20] Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11873–11882, 2020.
[21] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
[22] Adam Houenou, Philippe Bonnifait, Véronique Cherfaoui, and Wen Yao. Vehicle trajectory prediction based on motion model and maneuver recognition. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 4363–4369. IEEE, 2013.
[23] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng, 82(1):35–45, 1960.
[24] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[25] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
[26] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deep ensembles. In Advances in Neural Information Processing Systems, pages 2119–2127, 2016.
[27] Chenyang Lei, Yazhou Xing, and Qifeng Chen. Blind video temporal consistency via deep video prior. Advances in Neural Information Processing Systems, 33, 2020.
[28] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, and Raquel Urtasun. Learning lane graph representations for motion forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 541–556, 2020.
[29] Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, and Bolei Zhou. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7577–7586, 2021.
[30] Qiujing Lu, Weiqiao Han, Jeffrey Ling, Minfa Wang, Haoyu Chen, Balakrishnan Varadarajan, and Paul Covington. Kemp: Keyframe-based hierarchical end-to-end deep model for long-term trajectory prediction. arXiv preprint arXiv:2205.04624, 2022.
[31] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
[32]Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox. Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7153, 2019.
[33] Karttikeya Mangalam, Harshayu Girase, Shreyas Agarwal, Kuan-Hui Lee, Ehsan Adeli, Jitendra Malik, and Adrien Gaidon. It is not the journey but the destination: Endpoint conditioned trajectory prediction. arXiv preprint arXiv:2004.02025, 2020.
[34] Jean Mercat, Thomas Gilles, Nicole El Zoghby, Guillaume Sandou, Dominique Beauvois, and Guillermo Pita Gil. Multi-head attention for multi-modal joint vehicle motion forecasting. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9638–9644. IEEE, 2020.
[35] Sriram Narayanan, Ramin Moslemi, Francesco Pittaluga, Buyu Liu, and Manmohan Chandraker. Divide-and-conquer for lane-aware diverse trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15799–15808, 2021.
[36] Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene transformer: A unified multi-task model for behavior prediction and planning. arXiv preprint arXiv:2106.08417, 2021.
[37] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198, 1999.
[38] Hao Ouyang, Tengfei Wang, and Qifeng Chen. Internal video inpainting by implicit long-range propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14579–14588, October 2021.
[39] Tung Phan-Minh, Elena Corina Grigore, Freddy A Boulton, Oscar Beijbom, and Eric M Wolff. Covernet: Multimodal behavior prediction using trajectory sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14074–14083, 2020.
[40] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. In International Conference on Neural Information Processing Systems, 2015.
[41] Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE International Conference on Computer Vision, pages 3591–3600, 2017.
[42] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29:1163–1171, 2016.
[43] Jens Schulz, Constantin Hubmann, Julian Löchner, and Darius Burschka. Interaction-aware probabilistic behavior prediction in urban environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3999–4006. IEEE, 2018.
[44] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3):83–98, 2013.
[45] Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. Advances in neural information processing systems, 4, 1991.
[46] Haoran Song, Di Luan, Wenchao Ding, Michael Y Wang, and Qifeng Chen. Learning to predict vehicle trajectories with model-based planning. In Conference on Robot Learning, pages 1035–1045. PMLR, 2021.
[47] NN Sriram, Gourav Kumar, Abhay Singh, M Siva Karthik, Saket Saurav, Brojeshwar Bhowrnick, and K Madhava Krishna. A hierarchical network for diverse trajectory proposals. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 689–694. IEEE, 2019.
[48] DiJia Andy Su, Bertrand Douillard, Rami Al-Rfou, Cheol Park, and Benjamin Sapp. Narrowing the coordinate-frame gap in behavior prediction models: Distillation for efficient and accurate scene-centric motion forecasting. In 2022 International Conference on Robotics and Automation (ICRA), pages 653–659. IEEE, 2022.
[49] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020.
[50] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[51] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivastava, Khaled S Refaat, Nigamaa Nayakanti, Andre Cornman, Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir Anguelov, et al. Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. arXiv preprint arXiv:2111.14973, 2021.
[52] Tengfei Wang, Jiaxin Xie, Wenxiu Sun, Qiong Yan, and Qifeng Chen. Dual-camera super-resolution with aligned attention modules. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2001–2010, October 2021.
[53] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2566–2576, 2019.
[54] Guotao Xie, Hongbo Gao, Lijun Qian, Bin Huang, Keqiang Li, and Jianqiang Wang. Vehicle trajectory prediction by integrating physics-and maneuver-based approaches using interactive multiple models. IEEE Transactions on Industrial Electronics, 65(7):5999–6008, 2017.
[55] Maosheng Ye, Tongyi Cao, and Qifeng Chen. Tpcn: Temporal point cloud networks for motion forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11318–11327, 2021.
[56] Wenyuan Zeng, Ming Liang, Renjie Liao, and Raquel Urtasun. Lanercnn: Distributed representations for graph-centric motion forecasting. arXiv preprint arXiv:2101.06653, 2021.
[57] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019.
[58]Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, et al. Tnt: Target-driven trajectory prediction. arXiv preprint arXiv:2008.08294, 2020.
[59] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Se-ssd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14494–14503, 2021.
[60] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
[61] Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, and Kejie Lu. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8823–8833, 2022.
[62] Julius Ziegler, Philipp Bender, Markus Schreiber, Henning Lategahn, Tobias Strauss, Christoph Stiller, Thao Dang, Uwe Franke, Nils Appenrodt, Christoph G Keller, et al. Making bertha drive—an autonomous journey on a historic route. IEEE Intelligent transportation systems magazine, 6(2):8–20, 2014.

Xet Storage Details

Size:: 73.9 kB
Xet hash:: 12c2b1d65a61d9812aace7f75306aaa1dd0baa74438f7f1a5d0cbbbdbcecc0a9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.