Benchmarking Techniques for Real-Time Evaluation of LLMs In Production Systems

Reena Chandra, Rishab Bansal, Karan Lulla

Abstract


Large language models (LLMs) should perform reliably and work efficiently in today's applications that use AI chatbots, copilots, and search systems. Usually, the traditional type of benchmarking deals mainly with understanding linguistics and accurate task performance, while important factors like latency, how much memory is used, and optimisation are ignored. A benchmarking framework is proposed in this study that reviews LLMs using four critical factors: number of tokens processed per second, accuracy, peak memory usage, and Efficiency. Using the Open LLM Performance dataset, 350 open-source models were examined with standardized tools and methods across various families and sizes of parameters. According to the studies, the TinyStories-33M and OPT-19M middle-scale models are ideal for practical use because they handle many words per second without taking up much memory. ONNX Run-time uses less memory than PyTorch, and applying LLM.fp4 quantisation greatly increases throughput without a significant loss in accuracy. Visualisations and ranks are presented to help choose a production model. By following the framework, AI engineers, MLOps teams, and system architects can spot innovative models that can be built, deployed, expanded, and managed within budget. It improves LLM assessment by relating technical measures to practical limitations in real systems so that smarter choices can be made for systems used in actual operations.


Keywords


: Large Language Models, Real-Time Inference, Benchmarking, Throughput, Production Deployment

Full Text:

PDF

References


Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300. 2020 Sep 7.

Agrawal, A., et al., Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems. arXiv preprint arXiv:2407.07000, 2024.

Ragsdale, J. and R.V. Boppana, On designing low-risk honeypots using generative pre-trained transformer models with curated inputs. IEEE Access, 2023. 11: p. 117528-117545.

Menghani, G., Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 2023. 55(12): p. 1-37.

Chang, Y., et al., A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 2024. 15(3): p. 1-45.

Zhang, S., et al. Gpt4roi: Instruction tuning large language model on region-of-interest. in European Conference on Computer Vision. 2025. Springer.

Paleyes, A., R.-G. Urma, and N.D. Lawrence, Challenges in deploying machine learning: a survey of case studies. ACM computing surveys, 2022. 55(6): p. 1-29.

Ahmed, T., et al., Studying llm performance on closed-and open-source data. arXiv preprint arXiv:2402.15100, 2024.

Bommasani, R., P. Liang, and T. Lee, Holistic evaluation of language models. Annals of the New York Academy of Sciences, 2023. 1525(1): p. 140-146.

Li, H., et al., Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.

McIntosh, T.R., et al., Inadequacies of large language model benchmarks in the era of generative artificial intelligence. IEEE Transac-tions on Artificial Intelligence, 2025.

Hodak, M., et al. Benchmarking large language models: opportunities and challenges. in Technology Conference on Performance Eval-uation and Benchmarking. 2023. Springer.

Zhou, Y. and K. Yang. Exploring tensorrt to improve real-time inference for deep learning. in 2022 IEEE 24th Int Conf on High Perfor-mance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependa-bility in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). 2022. IEEE.

Chitty-Venkata, K.T., et al., A survey of techniques for optimizing transformer inference. Journal of Systems Architecture, 2023. 144: p. 102990.

Saxena, D., et al., Performance analysis of machine learning centered workload prediction models for cloud. IEEE Transactions on Parallel and Distributed Systems, 2023. 34(4): p. 1313-1330.

Tang, Y., et al. Rethinking optimization and architecture for tiny language models. in Forty-first International Conference on Machine Learning. 2024.

JEON, B., MACHINE LEARNING SYSTEMS IN CONSTRAINED ENVIRONMENTS. 2024.

Alsaqer, S., et al., The potential of llms in hardware design. Journal of Engineering Research, 2024.

Saha, D., et al. Empowering hardware security with llm: The development of a vulnerable hardware database. in 2024 IEEE Internation-al Symposium on Hardware Oriented Security and Trust (HOST). 2024. IEEE.

Durán, F., et al., Energy consumption of code small language models serving with run-time engines and execution providers. arXiv pre-print arXiv:2412.15441, 2024.

Hendrycks, D., et al., Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

Liang, P., et al., Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.

Kwon, W., et al. Efficient memory management for large language model serving with pagedattention. in Proceedings of the 29th Sym-posium on Operating Systems Principles. 2023.

Yuan, Z., et al., EfficientLLM: Efficiency in Large Language Models. arXiv preprint arXiv:2505.13840, 2025.

Tamanampudi, V.M., Development of Real-Time Evaluation Frameworks for Large Language Models (LLMs): Simulating Production Environments to Assess Performance Stability Under Variable System Loads and Usage Scenarios. Distributed Learning and Broad Ap-plications in Scientific Research, 2024. 10: p. 326-359.

Ijesty, “International Journal of Engineering, Science and Information Technology,” Ijesty.org, 2025. https://www.ijesty.org/index.php/ijesty (accessed Jun. 17, 2025).




DOI: https://doi.org/10.52088/ijesty.v5i3.955

Article Metrics

Abstract view : 0 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Reena Chandra, Rishab Bansal, Karan Lulla

International Journal of Engineering, Science, and Information Technology (IJESTY) eISSN 2775-2674