Hybrid CNN Transformer Parallel Architecture for Robust Multi-Class Disaster Image Classification

Authors

  • Rajeev Kumar, Chaitanya Yadav, Ujjwal Suri, Bilal Ahmed

DOI:

https://doi.org/10.64882/ijrt.v14.i2.1277

Keywords:

Hybrid CNN + Transformer, Disaster Image Classification, Parallel Architecture, Vision Transformer, Multi-Class Classification, Deep Learning, Humanitarian AI, Early Warning Systems, Attention Mechanism, TensorFlow

Abstract

The quick and correct labelling of the disaster pictures is crucial during emergency management and the dispensation of resources towards the humanitarian activities. The article outlines a complete deep learning system, comprising of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on a dual-branch architecture (running both at the same time) to label a disaster image in 12 fine-grained categories of earthquake damage, urban and wildfire damage, floods, and landslides, drought, human casualties, general destruction of infrastructure, four non-damage control categories. Three hybrid CNN +Transformer are systematically trained, trained and benchmarked (1) CNN -> transformer (Sequential), (2) transformer-> CNN (Hierarchical) and (3) CNN +Transformer Proposed (Parallel). The parallel design is opposed to sequential and hierarchical designs, which run the risk of inter-branch bottlenecks losing information, instead of giving local spatial and global contextual representations to parallel branches, the output of which is then aggregated and finally classified. The proposed parallel model has the highest test accuracy of 71 percent, mean of 71 percent, mean 70 percent and ROC-AUC of 0.81 that is better than pure CNN or standalone Transformer baselines or hierarchical (68 percent, AUC=0.79) models. It is integrated into TensorFlow/Keras and is supplemented by a simple Tkinter-based GUI to allow the usage of models by non-technical users such as field workers and humanitarian responders to train models, load pre-trained weights and make real-time image inferences. The framework itself can be deployed using edges with the assistance of TensorFlow Lite and ONNX that can be deployed to perform low-latency inferences in potentially disaster-prone and poorly connected areas. The article is the first literature to elaborate research and applicability of the academic deep learning experience in the domain of operation humanitarian AI to provide a scalable and deployable, not to mention a context-based solution to the disaster response systems.

References

IPCC, "Climate Change 2022: Impacts, Adaptation and Vulnerability," Cambridge University Press, 2022.

A. Dosovitskiy et al., "An image is worth 16×16 words: Transformers for image recognition at scale," arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Proc. NeurIPS, 2012, pp. 1097–1105.

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. IEEE CVPR, 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

H. Touvron et al., "Training data-efficient image transformers & distillation through attention," arXiv:2012.12877, 2020.

A. Dosovitskiy et al., "ViT: Vision Transformer for large-scale image recognition," in Proc. ICLR, 2021.

Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," arXiv:2103.14030, 2021. https://doi.org/10.48550/arXiv.2103.14030

L. Chen et al., "Wildfire detection from multispectral satellite imagery using CNN," Remote Sensing, vol. 12, no. 15, 2020.

Z. Liu et al., "Real-time urban fire detection using Faster R-CNN on CCTV footage," IEEE Access, vol. 9, 2021.

Z. Zhao et al., "U-Net landslide segmentation from LiDAR and satellite imagery," ISPRS J. Photogramm. Remote Sens., vol. 157, 2019.

S. Islam et al., "Flood mapping from Sentinel-1 SAR imagery using SegNet," Int. J. Remote Sens., 2022.

A. Kumar et al., "CNN-based drought stress monitoring via NDVI time-series analysis," Comput. Electron. Agric., 2021.

L. Chen, Y. Zhang, R. Wang, S. Shan, and T.-S. Chua, "CrossViT: Cross-attention multi-scale vision transformer," arXiv:2103.14899, 2021.

H. Zhang, C. Wang, Y. Du, J. Liu, and P. Wang, "TransCNN: Hybrid attention-guided CNN for image classification," arXiv:2104.00954, 2021.

A. Vaswani et al., "Attention is all you need," in Proc. NeurIPS, vol. 30, pp. 5998–6008, 2017. https://doi.org/10.48550/arXiv.1706.03762

M. Tan and Q. V. Le, "EfficientNet: Rethinking model scaling for CNNs," in Proc. ICML, pp. 6105–6114, 2019.

O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional networks for biomedical image segmentation," in Proc. MICCAI, 2015.

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556, 2015.

J. Redmon et al., "You only look once: Unified real-time object detection," in Proc. IEEE CVPR, 2016, pp. 779–788.

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, 2014.

R. R. Selvaraju et al., "Grad-CAM: Visual explanations from deep networks," in Proc. IEEE ICCV, 2017, pp. 618–626.

N. Srivastava et al., "Dropout: A simple way to prevent neural networks from overfitting," J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.

S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training," arXiv:1502.03167, 2015.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

J. Deng et al., "ImageNet: A large-scale hierarchical image database," in Proc. IEEE CVPR, 2009.

Downloads

How to Cite

Rajeev Kumar, Chaitanya Yadav, Ujjwal Suri, Bilal Ahmed. (2026). Hybrid CNN Transformer Parallel Architecture for Robust Multi-Class Disaster Image Classification. International Journal of Research & Technology, 14(2), 438–460. https://doi.org/10.64882/ijrt.v14.i2.1277

Issue

Section

Original Research Articles

Similar Articles

<< < 6 7 8 9 10 11 12 13 14 15 > >> 

You may also start an advanced similarity search for this article.