Article Open Access

Factory-Grade Diagnostic Automation for GeForce and Data Centre GPUs

Karan Lulla, Reena Chandra, Kishore Ranjan

Abstract


The growing deployment of Graphics Processing Units (GPUs) across data centers, AI workloads, and cryptocurrency mining operations has elevated the importance of scalable, accurate, and real-time diagnostic mechanisms for hardware quality assurance (QA). Traditional factory QA processes are manual, time-consuming, and lack adaptability to subtle performance degradation. This study proposes an automated diagnostic pipeline that leverages publicly available GPU telemetry-like data, including hashrate, power draw, and efficiency metrics, to simulate factory-grade fault detection. Using the Kaggle “GPU Performance and Hashrate” dataset, we implement a machine learning-based framework combining XGBoost for anomaly classification and Long Short-Term Memory (LSTM) neural networks for temporal efficiency forecasting. Anomalies are heuristically labeled by identifying GPUs in the bottom 10% of the efficiency distribution, simulating fault flags. The XGBoost model achieves perfect accuracy on the test set with full interpretability via SHAP values, while the LSTM model captures degradation trends with low training loss and forecast visualizations. The framework is implemented in Google Colab to ensure accessibility and reproducibility. Diagnostic outputs include efficiency analysis, prediction overlays, and automated GPU health reports. Comparative results show higher efficiency variance in GeForce GPUs versus the more stable performance of data center models, highlighting hardware class differences. While limitations exist, such as reliance on simulated labels and static time windows, the study demonstrates the feasibility of ML-driven, scalable diagnostics using real-world data. This approach has direct applications in early fault detection, GPU fleet management, and embedded QA systems in both production and deployment environments.


Keywords


GPU Diagnostics, Machine Learning, Embedded Systems, GeForce, Data Centre GPUs

References


Dally WJ, Keckler SW, Kirk DB. Evolution of the graphics processing unit (GPU). IEEE Micro. 2021 Nov 22;41(6):42-51.

Mei L. Fintech fundamentals: Big data/cloud computing/digital economy.

Green H. Integrating AI with QA Automation for Enhaced Software Testing.

Rosenfeld V, Breß S, Markl V. Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys (CSUR). 2022 Jan 17;55(1):1-38.

Rout SS, Deb S. Efficient post-silicon debug platforms for future many-core systems (Doctoral dissertation, IIIT-Delhi).

Pandhare HV. Future of Software Test Automation Using AI/ML. International Journal Of Engineering And Computer Science. 2025 May;13(05).

Motylinski M, MacDermott Á, Iqbal F, Shah B. A GPU-based machine learning approach for detection of botnet attacks. Computers & Security. 2022 Dec 1; 123:102918.

Haleem A, Javaid M, Singh RP, Rab S, Suman R. Hyperautomation for the enhancement of automation in industries. Sen-sors International. 2021 Jan 1; 2:100124.

Makwana K. Advanced Memory BIST Implementation and validation for complex SOC Design (Doctoral dissertation, Institute of Technology).

von Zitzewitz VL. NVIDIA´ s Bet on Artificial Intelligence (Master's thesis, Universidade NOVA de Lisboa (Portugal)).

Mihali? F, Trunti? M, Hren A. Hardware-in-the-loop simulations: A historical overview of engineering challenges. Elec-tronics. 2022 Aug 8;11(15):2462.

Li W. EFFICIENT AND ROBUST COMPUTE-IN-MEMORY FOR EDGE INTELLIGENCE (Doctoral dissertation, Georgia Insti-tute of Technology).

Rani S. Tools and techniques for real-time data processing: A review. International Journal of Science and Research Ar-chive. 2025;14(1):1872-81.

Selvaprasanth P, Malathy R. Revolutionizing structural health monitoring in marine environment with internet of things: a comprehensive review. Innovative Infrastructure Solutions. 2025 Feb;10(2):62.

Chen Z, Li Z, Huang J, Liu S, Long H. An effective method for anomaly detection in industrial Internet of Things using XGBoost and LSTM. Scientific Reports. 2024 Oct 14;14(1):23969.

Albahra S, Gorbett T, Robertson S, D'Aleo G, Kumar SV, Ockunzzi S, Lallo D, Hu B, Rashidi HH. Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic super-vised concepts. InSeminars in Diagnostic Pathology 2023 Mar 1 (Vol. 40, No. 2, pp. 71-87). WB Saunders.

Holzinger A. The next frontier: AI we can really trust. InJoint European conference on machine learning and knowledge discovery in databases 2021 Sep 13 (pp. 427-440). Cham: Springer International Publishing.

Cohen J, Huan X, Ni J. Shapley-based explainable ai for clustering applications in fault diagnosis and prognosis. Journal of Intelligent Manufacturing. 2024 Jul 29:1-6.

W.-K. Lee, R. C.-W. Phan, B.-M. Goi, L. Chen, X. Zhang, and N. N. Xiong, “Parallel and High Speed Hashing in GPU for Tele-medicine Applications,” IEEE Access, vol. 6, pp. 37991–38002, 2018, doi: https://doi.org/10.1109/ACCESS.2018.2849439.

N. Cini and G. Yalcin, “A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCs,” ACM Computing Surveys, vol. 53, no. 1, pp. 1–33, Feb. 2020, Doi: https://doi.org/10.1145/3372790.

Zülal Bingöl, M. Alser, O. Mutlu, O. Ozturk, and C. Alkan, “GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping,” IEEE Transactions on Computers, vol. 73, no. 5, pp. 1206–1218, Feb. 2024, doi: https://doi.org/10.1109/tc.2024.3365931.

S. Raptis, C. Ilioudis, and K. Theodorou, “From pixels to prognosis: unveiling radiomics models with SHAP and LIME for enhanced interpretability,” Biomedical Physics & Engineering Express, vol. 10, no. 3, p. 035016, Mar. 2024, doi: https://doi.org/10.1088/2057-1976/ad34db.

K. Cao, T. Zhang, and J. Huang, “Advanced hybrid LSTM-transformer architecture for real-time multi-task prediction in engineering systems,” Scientific Reports, vol. 14, no. 1, p. 4890, Feb. 2024, doi: https://doi.org/10.1038/s41598-024-55483-x.




DOI: https://doi.org/10.52088/ijesty.v5i3.1089

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Karan Lulla, Reena Chandra, Kishore Ranjan

International Journal of Engineering, Science, and Information Technology (IJESTY) eISSN 2775-2674