Visible camera-based semantic segmentation and semantic forecasting are important perception tasks in autonomous driving. In semantic segmentation, the current frame’s pixel-level labels are estimated using the current visible frame. In semantic forecasting, the future frame’s pixel-level labels are predicted using the current and the past visible frames and pixel-level labels. While reporting state-of-the-art accuracy, both of these tasks are limited by the visible camera’s susceptibility to varying illumination, adverse weather conditions, sunlight and headlight glare, etc. In this work, we propose to address these limitations using the deep sensor fusion of the visible and the thermal camera. The proposed sensor fusion framework performs both semantic forecasting as well as an optimal semantic segmentation within a multistep iterative framework. In the first or forecasting step, the framework predicts the semantic map for the next frame. The predicted semantic map is updated in the second step, when the next visible and thermal frame is observed. The updated semantic map is considered as the optimal semantic map for the given visible-thermal frame. The semantic map forecasting and updating are iteratively performed over time. The estimated semantic maps contain the pedestrian behavior, the free space, and the pedestrian crossing labels. The pedestrian behavior is categorized based on their spatial, motion, and dynamic orientation information. The proposed framework is validated using the public KAIST dataset. A detailed comparative analysis and ablation study is performed using pixel-level classification and intersection-over-union (IOU) error metrics. The results show that the proposed framework can not only accurately forecast the semantic segmentation map but also accurately update them.