Abstract

A multimodal deep learning model was developed to predict human driving behavior over a short time horizon. The training dataset was recorded specifically for this project.

Video 1 - Video demo, human driving behavior prediction.

More details about this work are provided in PredictionPousseur 2022 .

Introduction

Driving intention is represented as a sequence of vehicle states. Let $I$ be:

$$ I = \{x_0, x_1, ..., x_n\} \quad \text{Eq (1)} $$

Here, $(x_i) = (v_i, w_i)$ denotes the state at time $i$, where $v_i$ is the linear velocity and $w_i$ is the angular velocity.

Fig. 1 - Goal of the project.

The prediction model, $H_\Theta$, aims to produce:

$$ H_\Theta(X) = \{\hat{y}_{t+1},...,\hat{y}_{t+k}\} \quad \text{Eq (2)} $$

where

  • $ \Theta $ are the model parameters, $ X $ is the input vector at time $ t $
  • $ \hat{y}_{t+i} $ is the predicted state
  • $ (v_{t+i}, w_{t+i}) $ are the predicted velocities
  • $ dt $ is the acquisition interval in seconds.

Data and Model

Data Categories

Input data are grouped into three categories:

  • Vehicle state: velocities, accelerations, and other dynamics.
  • Environment state: map and Lidar data.
  • Control state: steering angle, pedal positions.
Fig. 2 - Data input.

Driving Scenarios

Behavior varies across scenarios such as highways, city streets, and roundabouts. The dataset includes both structured (lane-marked roads) and unstructured (urban intersections, roundabouts) environments to prevent bias.

Fig. 3 - Scenario categories.

Temporal Aspects

No manual labeling was required; future velocities served as output targets. The sensor acquisition rate is $dt = 100ms$ (10 Hz). A sliding window over past states captures temporal dependencies, and data augmentation (time warping, resampling) improves generalization.

Multi-Modal Model

The model consists of:

  1. Input model: compresses each modality into a latent vector.
  2. Final model: concatenates latent vectors and applies a recurrent network.
Fig. 4 - Handling unbalanced data representation.

The input model uses:

  • CNNs for image-based features (camera, Lidar projections).
  • Fully connected layers for numerical data.

The final model uses GRUs for sequence prediction.

Fig. 5 - Overall model architecture.

Results

Test Characteristics

Test NameRoundaboutsIntersectionsSpeed LimitDistanceTime RecordLanes
roundabout6170 km/h4 km378 s2
city4750 km/h4 km519 s1
speed (1 lane)2070 km/h2 km116 s1
speed (2 lanes)0090 km/h2 km84 s2

The following projections compare predicted and actual trajectories for different scenarios.

Test 1 Lane
Fig. 6.1 - Test 1 Lane
 Test 2 Lanes
Fig. 6.2 - Test 2 Lanes
 Test City
Fig. 6.3 - Test City
 Test Roundabout
Fig. 6.4 - Test Roundabout
Fig. 7 - Mean errors.

The error function is:

$$ loss(y_{predict},y_{true}) = w_{v} \cdot \sum_{i=0}^{n} | y_{predict,v,i} - y_{true,v,i} | + w_{w} \cdot \sum_{i=0}^{n} | y_{predict,w,i} - y_{true,w,i} | \quad \text{Eq (3)} $$

Where:

  • $w_v$ is the weight for linear velocity error.
  • $w_w$ is the weight for angular velocity error.

Example: City Test

Predicted linear velocity
Fig. 8.1 - Predicted linear velocity
 Predicted angular velocity in city scenario
Fig. 8.2 - Predicted angular velocity in city scenario

Sensitivity Analysis

The impact of each input was measured by neutralizing it and observing the change in error. This shows which inputs most influence predictions.

Fig. 9 - Sensitivity analysis motivation.
Fig. 10 - Effect of blind spots.
Fig. 11 - Sensitivity analysis results.

References

  • Prediction of human driving behavior using deep learning: a recurrent learning structure
    Hugo Pousseur, Alessandro Correa Victorino
    IEEE ITSC 2022
    [Access PDF]