Abstract

Visible camera-based semantic segmentation and semantic forecasting are important perception tasks in autonomous driving. In semantic segmentation, the current frame’s pixel-level labels are estimated using the current visible frame. In semantic forecasting, the future frame’s pixel-level labels are predicted using the current and the past visible frames and pixel-level labels. While reporting state-of-the-art accuracy, both of these tasks are limited by the visible camera’s susceptibility to varying illumination, adverse weather conditions, sunlight and headlight glare, etc. In this work, we propose to address these limitations using the deep sensor fusion of the visible and the thermal camera. The proposed sensor fusion framework performs both semantic forecasting as well as an optimal semantic segmentation within a multistep iterative framework. In the first or forecasting step, the framework predicts the semantic map for the next frame. The predicted semantic map is updated in the second step, when the next visible and thermal frame is observed. The updated semantic map is considered as the optimal semantic map for the given visible-thermal frame. The semantic map forecasting and updating are iteratively performed over time. The estimated semantic maps contain the pedestrian behavior, the free space, and the pedestrian crossing labels. The pedestrian behavior is categorized based on their spatial, motion, and dynamic orientation information. The proposed framework is validated using the public KAIST dataset. A detailed comparative analysis and ablation study is performed using pixel-level classification and intersection-over-union (IOU) error metrics. The results show that the proposed framework can not only accurately forecast the semantic segmentation map but also accurately update them.

References

1.
John
,
V.
,
Mita
,
S.
,
Liu
,
Z.
, and
Qi
,
B.
,
2015
, “
Pedestrian Detection in Thermal Images Using Adaptive Fuzzy C-Means Clustering and Convolutional Neural Networks
,” In
14th IAPR International Conference on Machine Vision Applications
, pp.
246
249
.
2.
John
,
V.
,
Guo
,
C.
,
Mita
,
S.
,
Kidono
,
K.
,
Guo
,
C.
, and
Ishimaru
,
K.
,
2016
, “
Fast Road Scene Segmentation Using Deep Learning and Scene-Based Models
,” In
ICPR
.
3.
John
,
V.
,
Tsuchizawa
,
S.
,
Liu
,
Z.
, and
Mita
,
S.
,
2017
, “
Fusion of Thermal and Visible Cameras for the Application of Pedestrian Detection
,”
Signal Image Video Process.
,
11
(
3
), pp.
517
524
.
4.
Gammulle
,
H.
,
Denman
,
S.
,
Sridharan
,
S.
, and
Fookes
,
C.
,
2019
, “
Predicting the Future: A Jointly Learnt Model for Action Anticipation
,” In
ICCV
, pp.
5561
5570
.
5.
Castrejón
,
L.
,
Ballas
,
N.
, and
Courville
,
A. C.
,
2019
, “
Improved Conditional vrnns for Video Prediction
,” In ICCV, pp.
7607
7616
.
6.
Jin
,
B.
,
Hu
,
Y.
,
Tang
,
Q.
,
Niu
,
J.
,
Shi
,
Z.
,
Han
,
Y.
, and
Li
,
X.
,
2020
,
Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction
.
7.
Ronneberger
,
O.
,
Fischer
,
P.
, and
Brox
,
T.
,
2015
, “
U-net: Convolutional Networks for Biomedical Image Segmentation
,” In
MICCAI
.
8.
Long
,
J.
,
Shelhamer
,
E.
, and
Darrell
,
T.
,
2015
, “
Fully Convolutional Networks for Semantic Segmentation
.” in
CVPR
, Nov.
9.
Noh
,
H.
,
Hong
,
S.
, and
Han
,
B.
,
2015
, “
Learning Deconvolution Network for Semantic Segmentation
.”
CoRR
.
10.
Badrinarayanan
,
V.
,
Kendall
,
A.
, and
Cipolla
,
R.
,
2015
, “
Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
,” In
CVPR
.
11.
Paszke
,
A.
,
Chaurasia
,
A.
,
Kim
,
S.
, and
Culurciello
,
E.
,
2016
, “
Enet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
.”
CoRR
.
12.
Zhao
,
H.
,
Shi
,
J.
,
Qi
,
X.
,
Wang
,
X.
, and
Jia
,
J.
,
2016
, “
Pyramid Scene Parsing Network
.”
CoRR
.
13.
John
,
V.
,
Boyali
,
A.
,
Thompson
,
S.
, and
Mita
,
S.
,
2021
, “
Bvtnet: Multi-Label Multi-Class Fusion of Visible and Thermal Camera for Free Space and Pedestrian Segmentation
,” In
ICPR Workshop
.
14.
John
,
V.
, and
Mita
,
S.
,
2019
, “
Rvnet: Deep Sensor Fusion of Monocular Camera and Radar for Image-Based Obstacle Detection in Challenging Environments
”. In
Image and Video Technology - 9th Pacific-Rim Symposium, PSIVT
,
C.
Lee
,
Z.
Su
, and
A.
Sugimoto
, eds.
15.
John
,
V.
,
Nithilan
,
M. K.
,
Mita
,
S.
,
Tehrani
,
H.
,
Konishi
,
M.
,
Ishimaru
,
K.
, and
Oishi
,
T.
,
2018
, “
Sensor Fusion of Intensity and Depth Cues Using the Chinet for Semantic Segmentation of Road Scenes
,” In
IEEE Intelligent Vehicles Symposium
, pp.
585
590
.
16.
Ha
,
Q.
,
Watanabe
,
K.
,
Karasawa
,
T.
,
Ushiku
,
Y.
, and
Harada
,
T.
,
2017
, “
Mfnet: Towards Real-Time Semantic Segmentation for Autonomous Vehicles with Multi-Spectral Scenes
,” In
2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
, pp.
5108
5115
.
17.
Hwang
,
S.
,
Park
,
J.
,
Kim
,
N.
,
Choi
,
Y.
, and
Kweon
,
I. S.
,
2015
, “
Multispectral Pedestrian Detection: Benchmark Dataset and Baselines
,” In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
.
18.
Hazirbas
,
C.
,
Ma
,
L.
,
Domokos
,
C.
, and
Cremers
,
D.
,
2017
,
FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture
.
19.
John
,
V.
,
Nithilan
,
M. K.
,
Mita
,
S.
,
Tehrani
,
H.
,
Sudheesh
,
R. S.
, and
Lalu
,
P. P.
,
2019
, “
So-net: Joint Semantic Segmentation and Obstacle Detection using Deep Fusion of Monocular Camera and Radar
”. In
Image and Video Technology - PSIVT International Workshops
,
J. J.
Dabrowski
,
A.
Rahman
, and
M.
Paul
, eds.
20.
Sun
,
Y.
,
Zuo
,
W.
, and
Liu
,
M.
,
2019
, “
Rtfnet: Rgb-Thermal Fusion Network for Semantic Segmentation of Urban Scenes
,”
IEEE Robotics Autom. Lett.
,
4
(
3
), pp.
2576
2583
.
21.
Wang
,
Y.
,
Zhang
,
J.
,
Zhu
,
H.
,
Long
,
M.
,
Wang
,
J.
, and
Yu
,
P. S.
,
2018
, “
Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics
.”
CoRR
.
22.
Oliu
,
M.
,
Selva
,
J.
, and
Escalera
,
S.
,
2018
, “
Folded Recurrent Neural Networks for Future Video Prediction
”. In
ECCV
,
V.
Ferrari
,
M.
Hebert
,
C.
Sminchisescu
, and
Y.
Weiss
, eds., pp.
745
761
.
23.
Neverova
,
N.
,
Luc
,
P.
,
Couprie
,
C.
,
Verbeek
,
J. J.
, and
LeCun
,
Y.
,
2017
, “
Predicting Deeper into the Future of Semantic Segmentation
.”
CoRR
.
24.
Villegas
,
R.
,
Yang
,
J.
,
Zou
,
Y.
,
Sohn
,
S.
,
Lin
,
X.
, and
Lee
,
H.
,
2017
, “
Learning to Generate Long-Term Future via Hierarchical Prediction
,”
ICML
, pp.
3560
3569
.
25.
Terwilliger
,
A. M.
,
Brazil
,
G.
, and
Liu
,
X.
,
2018
, “
Recurrent Flow-Guided Semantic Forecasting
.”
CoRR
.
26.
Wang
,
Y.
,
Jiang
,
L.
,
Yang
,
M.
,
Li
,
L.
,
Long
,
M.
, and
Fei-Fei
,
L.
,
2019
, “
Eidetic 3D LSTM: A Model for Video Prediction and Beyond
,” In
ICLR
.
27.
Liu
,
W.
,
Luo
,
W.
,
Lian
,
D.
, and
Gao
,
S.
,
2017
, “
Future Frame Prediction for Anomaly Detection - A New Baseline
.”
CoRR
.
28.
Lee
,
A. X.
,
Zhang
,
R.
,
Ebert
,
F.
,
Abbeel
,
P.
,
Finn
,
C.
, and
Levine
,
S.
,
2018
, “
Stochastic Adversarial Video Prediction
”.
arXiv preprint
.
29.
Aigner
,
S.
, and
Körner
,
M.
,
2018
, “
Futuregan: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3D Convolutions in Progressively Growing Autoencoder Gans
.”
CoRR
.
30.
Rasouli
,
A.
,
2020
,
Deep Learning for Vision-Based Prediction: A Survey
.
31.
Shi
,
X.
,
Chen
,
Z.
,
Wang
,
H.
,
Yeung
,
D.
,
Wong
,
W.
, and
Woo
,
W.
,
2015
, “
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting
.”
CoRR
.
32.
Kingma
,
D. P.
, and
Ba
,
J.
,
2015
, “
Adam: A Method for Stochastic Optimization
”. In
3rd International Conference on Learning Representations
,
Y.
Bengio
and
Y.
LeCun
, eds.
You do not currently have access to this content.