Performance Analysis Algorithm Classification and Regression Trees and Naive Bayes Based Particle Swarm Optimization for Credit Card Transaction Fraud Detection

With the advancement of technology, credit cards have become a popular tool for transactions, both physically and online, due to their ease of use and seamless integration with banking systems. However, with the increasing use of credit cards, the cases of fraud have also risen, resulting in financial losses for both cardholders and banks. To address this issue, effective and efficient credit card transaction fraud detection has become a top priority. Using machine learning algorithms is one of the techniques that can be employed to detect fraud in credit card transactions. The purpose of this research is to determine the performance and find the best method of the CART algorithm, Naive Bayes, and their combination with Particle Swarm Optimization (PSO) in detecting fraud in credit card transaction histories. The data used consists of 568,630 big data entries with parameters including id, V1-V28, amount, and class. The research results obtained are as follows: the accuracy of the Naive Bayes algorithm is 93.15%, precision is 94%, recall is 93%, and AUC is 0.99. For the CART algorithm, the accuracy is 99.96%, with precision and recall at 100%, and AUC at 1.00. Additionally, the Naive Bayes algorithm combined with PSO achieved an accuracy of 98.50%, precision and recall of 98%, and AUC of 1.00. Lastly, the CART algorithm combined with PSO reached an accuracy of 99.97%, with precision and recall at 100%, and AUC at 1.00. It can be concluded that the best method resulting from the tests conducted is the Classification and Regression Trees method combined with Particle Swarm Optimization.


Introduction
In today's digital era, which is advancing rapidly and significantly, the use of credit cards has seen a substantial increase due to their ease of use and seamless integration with banking systems.A credit card is one of the payment methods that replaces cash for transactions, provided by financial institutions such as banks.A credit card offers a facility in the form of credit to its holder, where at the due date, the owner can pay a minimum amount specified, and the remaining balance is treated as credit [1], [2].However, along with the increase in credit card usage, cases of crime are also vulnerable to occur, such as cybercrime.The crime that is susceptible to happen during credit card transactions is fraud.Fraud is a deception in credit card transactions that can harm both customers and banks or companies, carried out for the purpose of personal gain [3].In banking operations, according to Chapter 1, Article 1, of the Financial Services Authority Regulation of the Republic of Indonesia Number 39/POJK 03/2019 concerning the Implementation of Anti-Fraud Strategies for Commercial Banks, it states that fraud is a deviation or intentional negligence by an individual against the bank, its customers, or other parties to deceive, mislead, or manipulate anything that occurs within the bank's environment or while using banking facilities, resulting in losses for the bank, its customers, or other parties.The issue of fraud continues to increase every year and has become a serious problem.Despite various authentication methods being used, credit card fraud remains difficult to prevent.Fraud not only can cause significant financial losses for cardholders and banks, but it also affects public trust in the use of credit cards [4].
Therefore, to combat and prevent credit card transaction fraud, effective methods are needed to detect it.One method to identify fraud is by using existing customer data and processing it to identify regularities, patterns, or relationships from a large amount of data using pattern recognition methods such as mathematics and statistics, commonly known as data mining [5].To perform data mining classification, the steps taken can utilize algorithms found in machine learning.Machine learning focuses on developing a system that can learn and make decisions on its own without needing to be programmed repeatedly by humans [6].The algorithms commonly used include Random Forest, Naive Bayes, Classification and Regression Trees (CART), Support Vector Machine (SVM), and others.Machine Learning is widely used in fraud detection, marketing targeting, performance prediction, manufacturing, medical diagnostics, and more [7].In the research titled "Optimization of the Decision Tree Classification Algorithm (CART) with the Bagging Method for Web Phishing Detection," conducted by Ria Ester and Sartika Lina Mulani Sitio in 2024, an accuracy rate of 96.61% was achieved, which increased to 97.74% after optimization with bagging [8].In the research titled "Prediction System for Cervical Cancer Using CART, Naive Bayes, and k-NN" conducted by Tutus Praningki and Indra Budi in 2017, the accuracy value for Naive Bayes was 94.44%, while CART and k-NN achieved 88.89% and 85.04%, respectively [9].In the study titled "Email Spam Detection using Naive Bayes and Particle Swarm Optimization" conducted by Nandan Parmar et al. in 2020, the accuracy for Naive Bayes was 87.75% and increased to 95.50% after optimization with PSO [10].Then, in the research titled "Performance Analysis of CART and Naive Bayes Algorithms Based on Particle Swarm Optimization (PSO) for Cooperative Credit Feasibility Classification" conducted by Eko Arif Riyanto et al. in 2021, the accuracy value for Naive Bayes was 75%, which increased to 96.43% after being combined with PSO, while CART achieved an accuracy of 78.57%, which rose to 92.86% after being combined with PSO [11].Based on previous research, each algorithm has different accuracy results in its studies.Therefore, the author intends to conduct research with a different object, specifically in detecting fraud in credit card transaction histories, to assess the performance of each algorithm in detecting credit card fraud by referring to studies that have the highest accuracy, whether before or after being optimized with other techniques.Therefore, this research aims to be conducted on the topic of Classification and Regression Trees (CART) and Naive Bayes combined with Particle Swarm Optimization in detecting fraud in credit card transactions.

Data Collection
The research data used is a dataset sourced from a public data website that can be searched using the keywords "credit card fraud detection" on the platform Kaggle.com, or the data can be downloaded from the following link: https://www.kaggle.com/datasets/zeesolver/credit-card.The data used consists of 568,630 big data credit card transactions, with 31 labels and 2 classes (namely 0 and 1), which indicate fraud or legitimate.The table below represents the research dataset used.

Data Preparation
Table 1 presents several parts of the dataset used.In the preparation stage, the process involves cleansing, which means cleaning the data by searching for and correcting (or removing) incomplete, missing, incorrectly formatted, or duplicate data [12], [13].Data cleansing can be performed by checking for missing or empty values; if any are found, they can be added or removed.

Split Data
This research was conducted by dividing the data into two parts: training data and testing data, where the training data constitutes 70% of all data, amounting to 398.041 data points, while the testing data makes up 30% of the total data, which is 170.589data points.The training data is useful for the model to learn from the patterns and features within the training data, while the testing data is used to generate prediction results from each trained model.After that, a class imbalance check was performed using the SMOTE method.

Algorithm Model
This research utilizes two algorithms, namely Classification and Regression Trees (CART) and Naive Bayes.Additionally, this study also employs a combination with the Particle Swarm Optimization method (PSO).1. Classification Regression Trees (CART): CART is a nonparametric method for fulfilling classification requirements.This method uses repeated partitioning of the data set into new nodes.The goal of the CART algorithm is to obtain data groups with accurate classification characteristics.2. Naive Bayes: It is a classification method based on Bayes' theorem that calculates the probability of a target class based on the probabilities of existing features [14].The goal of this algorithm is to find the best way to compare a portion of new data with a set of classifications in various problems [15].3. Particle Swarm Optimization (PSO): is one way to improve accuracy by calculating the best weights for each data set and producing an average swarm for the entire sample.The ultimate goal of PSO is to find the optimal solution in the search space.This can be the minimum or maximum value of the objective function being optimized [16], [17].

Evaluation Model Algorithm
Model evaluation is the process of assessing the performance of a machine learning model to understand how well the model makes predictions based on the given data.The final stage of this research was conducted by involving a confusion matrix, by calculating the accuracy value, precision value, recall value, and AUC value.(Area Under the Curve).The purpose of using a confusion matrix is to show how the classification model performs on the testing dataset with known actual values, allowing us to see how the algorithm operates [18].The following is the confusion matrix in Table 2.

Legitimate Fraudulent
True Legitimate TP FN  The above figure 1 is the result of predictions from the CART Algorithm compared to the original dataset of the study.The results only display the top 5 data points and the bottom 6 data points because there is a large amount of data available  The above figure 3 shows that the selection results conducted by PSO produced 17 relevant attributes for the CART algorithm out of 31 attributes.The selected attributes are id, V1, V4, V8, V9, V10, V12, V13, V15, V16, V17, V18, V23, V24, V26, V27, and the amount.The above figure 4 shows that the PSO method only selects 6 attributes deemed relevant for the Naive Bayes algorithm in that dataset.The attributes chosen by PSO for the Naive Bayes algorithm are id, V4, V6, V17, V26, and amount.The prediction results of the algorithm are also displayed in the form of a confusion matrix.Below is the confusion matrix table for each algorithm.The meaning of the table above is that, in True Positives (TP), there are 85,264 correct predictions that transactions that are actually legitimate have been correctly classified as legitimate transactions.In False Negatives (FN), there are 30 incorrect predictions, where transactions that are actually legitimate (positive class) are classified as fraudulent; in this case, the model makes a mistake by considering legitimate transactions as fraud.Then, in False Positives (FP), there are 29 incorrect predictions where transactions that are actually fraudulent have been classified as (negative class).In True Negatives (TN), there are 85,266 correct predictions that transactions that are actually fraudulent have been correctly classified.

Table 6. Confusion Matrix Algorithm Naive Bayes + PSO
The meaning of the table above is that, in True Positives (TP), there are 83,616 correct predictions that transactions that are actually legitimate have been correctly classified as legitimate transactions.In False Negatives (FN), there are 1,678 incorrect predictions, where transactions that are actually legitimate (positive class) are classified as fraudulent; in this case, the model makes a mistake by considering legitimate transactions as fraud.Then, in False Positives (FP), there are 888 incorrect predictions that transactions that are actually fraudulent have been classified as legitimate (negative class).In True Negatives (TN), there are 84,407 correct predictions that transactions that are actually fraudulent have been correctly classified.
The following is the performance calculation of the algorithm used in this research.In addition, the performance results are also presented in the form of graphs and ROC curves.

Analyze the Test Result
Based on the tests that have been conducted, the performance of the naive bayes algorithm in detecting fraud falls into the good category at 93.15% and becomes increasingly accurate when combined with particle swarm optimization.PSO works very optimally in improving the testing accuracy of the naive bayes algorithm, with data accuracy increasing 5.35% to 98.50%.This proves that PSO is very suitable to be combined with naive bayes.The CART algorithm also performs very well, achieving a very high result of 99.96%, which is considered nearly perfect.This demonstrates that the performance of the CART algorithm in detecting fraud is very optimal and improves even further when combined with PSO.CART combined with PSO yields an accuracy of 99.97%; although it only increased by 0.01%, the combination with PSO indeed provides a value enhancement.It is also evident in previous research that PSO has a significant impact on improving accuracy, as demonstrated in several studies conducted.PSO identifies the most relevant features for the model, which is useful so that the model is only trained with the most informative features, reducing the risk of overfitting and improving overall accuracy.The dataset used in the research falls into the category of high-dimensional data, meaning it has many features.PSO can handle this high-dimensional data by selecting the most relevant features that can yield better performance.Therefore, the use of PSO combined with the CART and Naive Bayes algorithms shows a significant increase in accuracy compared to using CART and Naive Bayes individually.Based on this, the combination with PSO is the right approach to enhance accuracy in research, and from the four tests that have been conducted, it can be concluded that the CART algorithm combined with PSO is the most optimal method for detecting fraud in big data of credit card transactions.

Conclusion
Based on the tests conducted regarding the performance of the Classification and Regression Trees (CART) algorithm and Particle Swarm Optimization-based Naive Bayes for detecting credit card transaction fraud, it can be concluded that the performance testing was carried out through several stages.First is the data analysis stage, which involves understanding the data to identify issues within it.Next, the second stage is data preparation, where data cleaning is performed to check for missing values or to see if there is any data that is missing, followed by the data balancing stage.The third step is to divide the data, allocating 70% for training data and 30% for testing data, where the testing data is used to predict the performance of the algorithm.The next stage involves using a confusion matrix to calculate the accuracy, precision, and recall of the predictions and to determine the best algorithm for detecting credit card transaction fraud.The test results show that the accuracy of the Naive Bayes algorithm is 93.15%, while the Classification and Regression Trees (CART) algorithm has an accuracy of 99.96%.The accuracy of the Naive Bayes algorithm combined with Particle Swarm Optimization (PSO) is 98.50%, and finally, the Classification and Regression Trees (CART) algorithm combined with Particle Swarm Optimization (PSO) has an accuracy of 99.97%.Thus, the best and most optimal model for detecting credit card transaction fraud is the CART algorithm combined with Particle Swarm Optimization (PSO).

4 .
Positive): the number of correct predictions that what is predicted as the legitimate class is indeed the positive legitimate class.TN (True Negative): the number of correct predictions that what is predicted as the fraudulent class is indeed the negative fraudulent class.FP (False Positive): the number of incorrect predictions that what is predicted as the legitimate class is actually the fraudulent class.FN (False Negative): the number of incorrect predictions that what is predicted as the fraudulent class is actually the legitimate class.The performance calculation of the algorithm is done by finding the values of accuracy, precision, recall, and AUC value.1. Accuracy: Accuracy can indicate the magnitude of the ratio or how accurate the results of a prediction are.The following equation can be used to find the accuracy value.: Precision can be defined as the match between the information request and the response provided.Precision can also be defined as the ratio of relevant items selected compared to all items chosen.To calculate the precision value, you can use the following equation.: Recall is the number of relevant items selected compared to the total number of relevant items available.To calculate the recall value, you can use the following equation.AUC value: used to measure the performance of predictive models in binary classification.The AUC curve is a square area with values always between 0 and 1, and it indicates the level of accuracy or probability of the predictive model.

Fig 1 .
Fig 1.Comparison of original data and the prediction results of the CART algorithm.

Fig 2 .
Fig 2. Comparison of the original data and the prediction results of the Naive Bayes algorithm displayed in 20 data points.The above figure2is the of predictions from the Naive Bayes Algorithm compared to the original dataset of the study.The results only display the top 20 data points.

Fig 3 .
Fig 3. Results of feature selection using PSO for the CART algorithm.

Fig 4 .
Fig 4. Results of feature selection using PSO for the Naive Bayes algorithm.

Fig 5 .
Fig 5. Comparison chart of algorithm performanceThe above figure5is a comparison graph of the performance of each algorithm.It can be seen from the graph that the CART algorithm combined with PSO has a superior value compared to the other algorithms, although it is only 0.01% different from the CART algorithm without the combination with PSO.

Table 1 .
Credit Card Transaction Dataset

Table 3 .
Confusion Matrix Algorithm CARTThe data tested consists of 170,589 big data points, which are testing data.The purpose of the table above is as follows: in True Positives (TP), there are 85,264 correct predictions that transactions that are actually legitimate have been accurately classified as legitimate transactions.In False Negatives (FN), there are 30 incorrect predictions, where transactions that are actually legitimate (positive class) have been classified as fraudulent, meaning the model made an error by considering legitimate transactions as fraud.Then, in False Positives (FP), there are 33 incorrect predictions where transactions that are actually fraudulent have been classified as legitimate (negative class).In True Negatives (TN), there are 85,262 correct predictions that transactions that are actually fraudulent have been accurately classified.

Table 4 .
Confusion Matrix Algorithm Naive BayesThe meaning of the table above is that, in True Positives (TP), there are 75,153 correct predictions that transactions that are actually legitimate have been correctly classified as legitimate transactions.In False Negatives (FN), there are 10,141 incorrect predictions, where transactions that are actually legitimate (positive class) are classified as fraudulent, meaning the model makes a mistake by considering legitimate transactions as fraud.Then, in False Positives (FP), there are 1,545 incorrect predictions that transactions that are actually fraudulent have been classified as legitimate (negative class).In True Negatives (TN), there are 83,750 correct predictions that transactions that are actually fraudulent have been correctly classified.

Table 5 .
Confusion Matrix Algorithm CART + PSO

Table 7 .
Results of the algorithm performance calculation