Nexar Dashcam Crash Prediction Challenge - Overview and Baseline Notebook
Table of Content
(Kaggle) 🚗💥 Nexar Dashcam Crash Prediction Challenge (Concluded May 4th, 2025)
(Arxiv) Nexar Dashcam Collision Prediction Dataset and Challenge
Task and Dataset Summary
Given a 1280x720@30FPS dashcam footage, you must be able to give a warning as soon as a risk of collision/near miss is detectable. This involves predicting a signal given sequences of frames. This task is also known as TTC estimation (time-to-collision estimation).
Training data includes 1500 videos. These videos have a warning timestamp (the moment the risk is detectable) and an event timestamp (time of collision/near miss), all roughly 40 seconds long. Training videos include everything in full, from before to after the event. These are called positive cases, which makes half of the training set, the other half are normal driving footage.
Test videos are much shorter than train video, and generally does not include the footage after the event, just the moment right before an event occurs if any. Therefore with the test data its more about detecting and giving a warning as soon as possible regarding collision risk. There are 1344 test videos, at the time of writing this post, labels for test videos are not available (check Kaggle and Nexar sources).
This task involves image understanding, temporal modeling and possibly latent reasoning, a true complex computer vision application.
Notebook Reviews
Baseline: InceptionV3 + Farneback & Ensemble BiLSTM
(Kaggle) 💥Nexar DCP Challenge - Baseline 💡
# Core Imports
import numpy as np
import pandas as pd
import cv2
import os
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import joblib
from joblib import Parallel, delayed
from scipy.stats import uniform, randint
import warnings
warnings.filterwarnings('ignore')
import keras_tuner as kt
# Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, LearningRateScheduler
from tensorflow.keras.applications.inception_v3 import preprocess_input
from tensorflow.keras.applications import EfficientNetB0
# Machine Learning
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import RandomizedSearchCV
Enhanced Frame Sampling & Feature Extraction
def extract_critical_frames(video_path, alert_time, event_time, num_frames=8, sampling_interval=30):
"""Optimized frame extraction without redundant capture"""
cap = cv2.VideoCapture(video_path)
frames = []
frame_count = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_count % sampling_interval == 0:
frame = cv2.resize(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB), (224, 224))
frames.append(frame)
if len(frames) >= num_frames:
break
frame_count += 1
cap.release()
return np.array(frames[:num_frames]) # Return exactly num_frames
def calculate_optical_flow(frames):
"""Calculate dense optical flow between consecutive frames"""
flows = []
prev_gray = cv2.cvtColor(frames[0], cv2.COLOR_RGB2GRAY)
for frame in frames[1:]:
gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
flow = cv2.calcOpticalFlowFarneback(prev_gray, gray, None, 0.5, 3, 15, 3, 5, 1.2, 0)
flows.append(np.linalg.norm(flow, axis=2))
prev_gray = gray
return np.array(flows)
extract_critical_frames
downsamples a video from video_path
, returns num_frames
of 224x224 images, resized using bilinear interpolation. alert_time
and event_time
doesn’t seem to be in use, could be placeholder for further improvements.
calculate_optical_flow
uses the magnitude of optical flow maps returned by the Farneback method (Pseudo: Farneback(prev_frame, current_frame) -> size [H, W, 2]
). Returns [H, W, num_frames - 1]
flow maps.
Hybrid Feature Engineering
# Initialize feature extractor
base_model = InceptionV3(weights='imagenet', include_top=False, pooling='avg')
cnn_feature_dim = base_model.output_shape[-1]
def get_hybrid_features(video_path, alert_time, event_time):
"""Optimized feature extraction with proper resource handling"""
# Use optimized frame extraction
frames = extract_critical_frames(
video_path,
alert_time,
event_time,
num_frames=8, # Reduced from 16
sampling_interval=30 # Process 1 frame per second (30fps video)
)
if len(frames) == 0:
return np.zeros(1280 + 1) # EfficientNetB0 features + flow feature
# Batch process spatial features
spatial_features = base_model.predict(
preprocess_input(frames.astype('float32')),
batch_size=32, # Process 32 frames at once
verbose=0
)
# Simplified temporal feature
flow_feature = 0.0
if len(frames) > 1:
prev_gray = cv2.cvtColor(frames[0], cv2.COLOR_RGB2GRAY)
next_gray = cv2.cvtColor(frames[-1], cv2.COLOR_RGB2GRAY)
flow = cv2.calcOpticalFlowFarneback(prev_gray, next_gray, None, 0.5, 3, 15, 3, 5, 1.2, 0)
flow_feature = np.mean(np.linalg.norm(flow, axis=2))
return np.concatenate([
np.mean(spatial_features, axis=0),
[flow_feature]
])
(Arxiv) InceptionV3 without the classification head, cnn_feature_dim
is most likely of size (None, 2048)
. Mention of EfficientNetB0 (paper btw) seems to be a mistake as the actual CNN extractor is InceptionV3
, the line with return np.zeros(1280 + 1)
should be return np.zeros(cnn_feature_dim + 1)
.
Flow feature here is average of the map between only the first and last frame, a single scalar value, does not make much sense beyond baseline example.
Feature vectors from InceptionV3 is also averaged in np.mean(spatial_features, axis=0)
, thus get_hybrid_features
returns a vector of shape (2049,)
.
Temporal Model Architecture
def build_temporal_model(hp):
model = Sequential([
# Tune the number of LSTM units
Bidirectional(LSTM(
units=hp.Int("lstm_units", min_value=32, max_value=256, step=32),
input_shape=(1, X_train_scaled.shape[1])
)),
# Fully connected layer (Dense)
Dense(
units=hp.Int("dense_units", min_value=16, max_value=128, step=16),
activation="relu"
),
# Dropout for regularization
Dropout(hp.Float("dropout", min_value=0.2, max_value=0.5, step=0.1)),
# Output layer
Dense(1, activation="sigmoid")
])
# Compile the model
model.compile(
loss="binary_crossentropy",
optimizer=tf.keras.optimizers.Adam(
hp.Choice("learning_rate", values=[1e-2, 1e-3, 1e-4])
),
metrics=["accuracy", tf.keras.metrics.AUC()]
)
return model
BiLSTM with tunable hyperparameters, relatively small model, input_shape=(1, X_train_scaled.shape[1])
here suggests only 1 frame is being used -> not ideal.
# Load and preprocess data
train_df = pd.read_csv('/kaggle/input/nexar-collision-prediction/train.csv')
train_df['id'] = train_df['id'].apply(lambda x: f"{int(float(x)):05d}")
train_df.fillna({'time_of_alert': 0, 'time_of_event': 0}, inplace=True)
# Feature extraction
print("Extracting hybrid features...")
features = []
for _, row in tqdm(train_df.iterrows(), total=len(train_df), desc="Hybrid features extracted:"):
video_path = f"/kaggle/input/nexar-collision-prediction/train/{row['id']}.mp4"
features.append(get_hybrid_features(
video_path, row['time_of_alert'], row['time_of_event']
))
X = np.array(features)
y = train_df['target'].values
# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y)
# Temporal model training
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
# Reshape for LSTM [samples, timesteps, features]
X_train_3d = X_train_scaled.reshape((X_train_scaled.shape[0], 1, X_train_scaled.shape[1]))
X_val_3d = X_val_scaled.reshape((X_val_scaled.shape[0], 1, X_val_scaled.shape[1]))
tf.keras.mixed_precision.set_global_policy('mixed_float16')
As mentioned get_hybrid_features
returns shape (2049,)
, so each video is exactly 1 vector, scaled by StandardScaler()
(in a manner that avoids data leakage). Data is reshaped to 3D but doesn’t change the fact that each video is just 1 vector as time step is only 1.
tf.keras.mixed_precision.set_global_policy('mixed_float16')
keeps models’ weight in float32
but computations and activations are in float16
(increased speed and memory efficiency, may impact precision and requires GPU).
Best Model from HP Tuning
# Initialize the tuner
tuner = kt.Hyperband(
build_temporal_model,
objective="val_accuracy",
max_epochs=20,
factor=3,
directory="kt_logs",
project_name="temporal_model_tuning"
)
# Perform hyperparameter search
tuner.search(
X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=20,
batch_size=64,
callbacks=[tf.keras.callbacks.EarlyStopping(patience=5)]
)
# Retrieve the best model
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
best_model = tuner.get_best_models(num_models=1)[0]
best_model.summary()
# temporal_model = build_temporal_model((1, X_train_scaled.shape[1]))
history = best_model.fit(
X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=30, # Reduced epochs
batch_size=64, # Increased batch size
callbacks=[
EarlyStopping(patience=3, restore_best_weights=True),
LearningRateScheduler(lambda epoch: 0.001 * (0.95 ** epoch))
]
)
Ensemble Modeling
# Generate temporal model predictions
train_probs = best_model.predict(X_train_3d).flatten()
val_probs = best_model.predict(X_val_3d).flatten()
# Create ensemble dataset
X_train_ensemble = np.column_stack([train_probs, X_train_scaled])
X_val_ensemble = np.column_stack([val_probs, X_val_scaled])
# Define hyperparameter search space
param_dist = {
'device': ['cuda'],
'tree_method': ['hist'],
'learning_rate': uniform(0.01, 0.3),
'max_depth': randint(3, 10),
'subsample': uniform(0.6, 0.4), # 0.6-1.0
'colsample_bytree': uniform(0.6, 0.4),
'gamma': uniform(0, 0.5),
'reg_alpha': uniform(0, 1),
'reg_lambda': uniform(0, 1),
'n_estimators': randint(100, 500)
}
# Create Bayesian-optimized search
optimizer = RandomizedSearchCV(
estimator=XGBClassifier(
objective='binary:logistic',
eval_metric='auc',
use_label_encoder=False
# tree_method='gpu_hist' # Enable GPU acceleration
),
param_distributions=param_dist,
n_iter=50, # Number of parameter combinations
scoring='roc_auc',
cv=3,
n_jobs=-1,
verbose=2
)
# Run optimization
optimizer.fit(X_train_ensemble, y_train)
# Best model evaluation
best_xgb = optimizer.best_estimator_
best_xgb.fit(X_train_ensemble, y_train,
eval_set=[(X_val_ensemble, y_val)],
early_stopping_rounds=20,
verbose=False)
ensemble_val_probs = best_xgb.predict_proba(X_val_ensemble)[:, 1]
print(f"Optimized AUC: {roc_auc_score(y_val, ensemble_val_probs):.4f}")
print("Best parameters:", optimizer.best_params_)
Ensemble learning from the original best LSTM model, output as features for XGBoost algorithm. Quite exhaustive param search.
Inference & Submission
# Process test set
test_df = pd.read_csv('/kaggle/input/nexar-collision-prediction/test.csv')
test_df['id'] = test_df['id'].apply(lambda x: f"{int(float(x)):05d}")
test_features = []
for _, row in tqdm(test_df.iterrows(), desc="Processing Test Videos"):
video_path = f"/kaggle/input/nexar-collision-prediction/test/{row['id']}.mp4"
test_features.append(get_hybrid_features(video_path, 0, 0)) # No event times in test
X_test = scaler.transform(np.array(test_features))
X_test_3d = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
# Generate predictions
temporal_probs = best_model.predict(X_test_3d).flatten()
X_test_ensemble = np.column_stack([temporal_probs, X_test])
final_probs = best_xgb.predict_proba(X_test_ensemble)[:, 1]
# Create submission
submission = pd.DataFrame({
'id': test_df['id'],
'score': final_probs
})
submission.to_csv('submission.csv', index=False)
print("\nSubmission Summary:")
print(submission.describe())
Conclusion
The baseline proposed a very basic and unoptimized approach to solving the task. Keynotes include:
- Image understanding: Low, InceptionV3 extracted features but entire video is aggregated to a single vector.
- Motion modeling: Low, Farneback optical flow, only uses a single average scalar to describe motion between the first and last frame.
- Model architecture: Inspirational, LSTM is reasonable for long-short term dependencies sequence modeling, suitable for dashcam footage, though not utilized much in this baseline.
The notebook provides a solid baseline as inspiration to solve the TTC estimation task (time-to-collision estimation), opening up many ways to improve upon.