Article Open Access

CrossTrans-Surv: An Artificial Intelligence-Based Multimodal Cross-Attention Transformer for Smart Surveillance and Human Activity Recognition

S R V Prasad Reddy

Abstract


Classifying and comprehending human behavior in the information provided is known as human movement detection. There are numerous real-world uses for it. Human movement tracking can be used in residential surveillance to monitor senior citizens' behavioral patterns and quickly identify risky behaviors such as falls. It can also assist an automated navigation system in analyzing and forecasting walking patterns. Notably, this system exhibits resilience against changing conditions like weather or light, whereas camera-based approaches falter in these situations. This study presents the AI-based cross-attention transformer framework for multimodal sensor fusion in smart surveillance and human activity detection systems, referred to as CrossTrans-Surv. CrossTrans-Surv, which draws inspiration from STAR-Transformer, integrates asynchronous visual (RGB), infrared/thermal, and LiDAR modalities via cross-attention layers that discover common representations across various data types. Pairs of multispectral images can offer combined knowledge about increasing the robustness and dependability of recognition applications in the real world. In contrast to earlier CNN-based studies, our network uses the Transformer approach to integrate global contextual information as well as learn dependencies that span distance during the feature extraction step. Next, we feed Transformer RGB frames and component heatmaps at various time and location qualities. We employ fewer layers for attention in the framework stream since the skeleton heat diagrams are important features as opposed to those initial RGB frames. Our methodology is appropriate for real-world AI-powered surveillance applications because it provides comprehensibility through consideration maps and scalability through modular design, in addition to performance advantages.


Keywords


CNN, CrossTans-Surv, LiDAR, Artificial Intelligence, Heatmaps, Transformers

References


A. Ghadami, A. Taheri, and A. Meghdari, “A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition,” arXiv preprint arXiv:2407.09544, 2024.

A. K. AlShami, R. Rabinowitz, K. Lam, Y. Shleibik, M. Mersha, T. Boult, and J. Kalita, “SMART-vision: survey of modern action recognition techniques in vision,” Multimedia Tools and Applications, pp. 1–72, 2024.

H. Zhang, Z. Zhuang, X. Wang, X. Yang, and Y. Zhang, “MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition,” arXiv preprint arXiv:2505.20744, 2025.

Q. Zhou, Y. Hou, R. Zhou, Y. Li, J. Wang, Z. Wu, et al., “Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training,” Connection Science, vol. 36, no. 1, p. 2325474, 2024.

N. Zheng and H. Xia, “Snn-driven multimodal human action recognition via event camera and skeleton data fusion,” arXiv preprint arXiv:2502.13385, 2025.

S. Zhang, J. Yin, and Y. Dang, “A generically Contrastive Spatiotemporal Representation Enhancement for 3D skeleton action recognition,” Pattern Recognition, vol. 164, p. 111521, 2025.

Y. Yang, J. Zhou, W. Hu, and Z. Tu, “End-to-end pose-action recognition via implicit pose encoding and multi-scale skeleton modeling,” The Visual Computer, pp. 1–17, 2025.

H. Le, C. K. Lu, C. C. Hsu, and S. K. Huang, “Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network,” Applied Intelligence, vol. 55, no. 4, pp. 1–21, 2025.

Y. Mou, K. Xu, X. Jiang, and T. Sun, “MV-guided deformable convolution network for compressed video action recognition with P-frames,” Neurocomputing, p. 130770, 2025.

A. Zam, A. Bohlooli, and K. Jamshidi, “Unsupervised deep domain adaptation algorithm for video based human activity recognition via recurrent neural networks,” Engineering Applications of Artificial Intelligence, vol. 136, p. 108922, 2024.

Z. Wang and J. Yan, “Multi-sensor fusion based industrial action recognition method under the environment of intelligent manufacturing,” Journal of Manufacturing Systems, vol. 74, pp. 575–586, 2024.

P. Lueangwitchajaroen, S. Watcharapinchai, W. Tepsan, and S. Sooksatra, “Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7,” Journal of Imaging, vol. 10, no. 12, p. 320, 2024.

D. Zhu, S. Bian, X. Xie, C. Wang, and D. Xiao, “Two-Stream Bidirectional Interaction Network Based on RGB-D Images for Duck Weight Estimation,” Animals, vol. 15, no. 7, p. 1062, 2025.

K. Hirooka, A. S. M. Miah, T. Murakami, Y. Akiba, Y. S. Hwang, and J. Shin, “Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition,” arXiv preprint arXiv:2503.16855, 2025.

J. Shi, Y. Zhang, W. Wang, B. Xing, D. Hu, and L. Chen, “A novel two-stream transformer-based framework for multi-modality human action recognition,” Applied Sciences, vol. 13, no. 4, p. 2058, 2023.

F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fusion transformer for multispectral object detection,” arXiv preprint arXiv:2111.00273, 2021.

H. J. Joo and J. Kim, “IS-CAT: Intensity–Spatial Cross-Attention Transformer for LiDAR-Based Place Recognition,” Sensors, vol. 24, no. 2, p. 582, 2024.

Z. Wang, Y. Yang, Z. Liu, and Y. Zheng, “Deep neural networks in video human action recognition: A review,” arXiv preprint arXiv:2305.15692, 2023.

H. Liu and T. Duan, “Real-Time Multimodal 3D Object Detection with Transformers,” World Electric Vehicle Journal, vol. 15, no. 7, 2024.

O. Amel, X. Siebert, and S. A. Mahmoudi, “Comparison Analysis of Multimodal Fusion for Dangerous Action Recognition in Railway Construction Sites,” Electronics, vol. 13, no. 12, p. 2294, 2024.




DOI: https://doi.org/10.52088/ijesty.v5i4.1241

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 S R V Prasad Reddy

International Journal of Engineering, Science, and Information Technology (IJESTY) eISSN 2775-2674