Xem mẫu

  1. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 DEVELOPING VOTING - ENSEMBLE MODEL FOR BANK MAKETING PREDICTION XÂY DỰNG MÔ HÌNH TỐI ƯU ĐỂ DỰ BÁO CHIẾN LƯỢC MAKETING TẠI NGÂN HÀNG Pham Thi Phuong Trang, MA University of Technology and Education - The University of Danang trangpham3112@gmail.com Abstract Maketing provides a link between the banks’ activities to the market and contributes to cre- ating a competitive position of the bank. Therefore, creating an intelligent model that can accu- rately predict bank maketing campaign has always been a great interest for investors and financial analysts. The objective of this paper is to propose a model-voting to classify bank maket- ing dataset. The voting model is constructed from three well-known individual artificial intelligent classification models, including Support vector machine (SVM), Navie Bayes (NB) and Decision Tree (DT). Analytical results showed the voting model was superior to other comparative models for bank maketing dataset. Particularly, the SVM - NB - DT was the best voting model achieving the highest result with 89.957% of accuracy. Moreover, other voting models as SVM-NB, NB-DT aslo had good results with 88.975% and 88.931% of accuracy, respectively. Therefore, voting- ensemble model is considered a suitable tool to predict bank maketing dataset. Keywords: Bank maketing, Decision Tree, Navie Bayes, Support vector machine, voting model. Tóm tắt Maketing cung cấp mối liên kết giữa các hoạt động của ngân hàng với thị trường và góp phần tạo ra vị thế cạnh tranh cho ngân hàng đó. Vì vậy, việc tạo ra một mô hình thông minh có thể dự báo một cách chính xác chiến lược maketing ngân hàng luôn là mối quan tâm lớn đối với các nhà đầu tư và các nhà phân tích tài chính. Mục tiêu của bài viết này là đề xuất một mô hình bỏ phiếu để phân loại dữ liệu maketing ngân hàng. Mô hình bỏ phiếu được xây dựng từ ba mô hình phân loại trí tuệ nhân tạo nổi tiếng, bao gồm máy học vectơ hỗ trợ (SVM), Navie Bayes (NB) và Cây quyết định (DT). Kết quả phân tích cho thấy mô hình bỏ phiếu là vượt trội so với các mô hình so sánh khác với bộ dữ liệu maketing ngân hàng. Đặc biệt, SVM - NB - DT là mô hình bỏ phiếu tốt nhất đạt kết quả cao nhất với 89.957% độ chính xác. Ngoài ra, các mô hình bỏ phiếu khác như SVM-NB, NB-DT cũng đạt kết quả khá tốt với độ chính xác lần lượt là 88.975% và 88.931%. Do đó, mô hình tập hợp bỏ phiếu được coi là một công cụ phù hợp để dự đoán bộ dữ liệu maketing ngân hàng. Từ khóa: Maketing ngân hàng, Cây quyết định, Navie Bayes, Máy học véc tơ hỗ trợ, mô hình bỏ phiếu. 592
  2. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 1. Introduction Bank maketing plays a vital role in establishing beneficial relationships among stakehold- ers. Bank maketing can be considered as the bank’s efforts to satisfy customer needs and realize profit goals. Therefore, the bank needs to orient the operations of its departments and the entire banking staff to create, maintain and develop relationships with customers - the factors that de- termine its survival. Thus, a wide range of methods have been proposed to improve the bank maketing cam- paigns. For example, Hu used data mining (DT) in analyzing retailing bank customer attrition (Hu 2005) or Li et al. also applied DT in predicting Credit Card Customer Segmentation and Tar- get Marketing Based (Li, Wu et al. 2011). Moreover, several works used a classification DM ap- proach to build a predictive model that can label a data item into binary classes (Abbas 2015, Asif 2018). Several DM algorithms can be used for classifying marketing contacts, each one with its own purposes and capabilities. Examples of popular DM techniques are: Naïve Bayes (NB), Decision Trees (DT) and Support Vector Machines (SVM). NB, DT and SVM are individual well-know models in classifying and were apllied in many works. For example, the SVM-based classification model has been used to forecast soil quality (Liu, Wang et al. 2016), relevance vector regression (RVR) and the SVM have been used to predict the Rock Mass Rating of tunnel host rocks (Gholami, Rasouli et al. 2013). Jiangtao Ren et.al proposed a novel Naive Bayes clas- sification algorithm for uncertain data (Ren, Lee et al. 2009). Moreover, many researchers used Naive Bayes model to classify Web Documents (Ren, Lee et al. 2009), Text document (Zhang and Gao 2011). With the popular application of SVM, NB and DT in classification problmes, in this study the author hope create models-based SVM, NB and DT to improve bank maketing prediction. Actually, no study has developed voting-ensemble model to predict bank maketing project. This is the reason why in this study the author uses NB, DT and SVM as baseline model to build voting ensemble model that combines multiple prediction models (or learning classifiers) to im- prove the performance of single (or baseline) models. The goal of the work is to propose the best model to enhance the bank maketing prediction. If bank maketing campaign is forecasted cor- rectly, financial institution can find the best strategies to improve for the next marketing campaign. To evaluate the predictive performances of models, each model tare compared in terms of accu- racy, precision and sensitivity. The remainder of this paper is organized as follows. Section 2 elucidates SVM, DT, NB, voting method and the predictive evaluation methods. The collection and detail of bank maket- ing dataset, and analytical results are mentioned in Section 3. Finally, conclusions are given in Section 4. 2. Methodology 2.1. Individual baseline models 2.1.1. Support vector machine Introduced by Vapnik et al. (1995) (Vapnik 1995), the SVMs executes a classification by constructing an N-dimensional hyperplane that optimally separates the data into binary categories. 593
  3. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 The support vector would train sample points at the edge of segment, while the “machine” was to some concerned algorithms in the field of machine learning (Zhang, Yang et al. 2015). The best hyperplane for a SVMs means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points. Figure 1 shows a basic structure of the binary support vector machines. 2.1.2. Navie Bayes A Naive Bayes classifier is a probabilistic machine learning model that’s used for classifi- cation task. The classifier is based on the Bayes theorem. Although Naive Bayes is simplicity, it often outperforms more sophisticated classification methods (Langley 1992). Many researchers have found that this assumption of independence donot work in all cases for which other alter- native methods are proposed to increase the performance. However, the Naive Bayesian classifier relies on two important assumptions. First, this simple scheme posits that the instances in each class can be summarized by a single probabilistic description, and that these are sufficient to dis- tinguish the classes from one other. 2.1.3. Decision Tree This algorithm repeatedly splits the data set according to a criterion that maximizes the separation of the data, resulting in a tree-like structure (Zhang and Gao 2011). The most common criterion employed is information gain; this means that at each split, the decrease in entropy due to this split is maximized. A major disadvantage of decision trees is given by the greedy con- struction process: at each step, the combination of single best variable and optimal split-point is selected; however, a multi-step lookahead that considers combinations of variables may obtain different results. 2.1.4. Voting method Voting is a method for a group in order to make a collective decision or express an opinion. We can say voting is a method of combining multiple classifiers (Kittler, Hatef et al. 1998, I. Kuncheva 2007). The reasons for combining classifiers are efficiency and accuracy. In this study, the author obtained seven ensemble classifiers consisting of two to three dif- ferent individual classifiers. The two-classifier ensembles were SVM - NB, SVM - DT and NB - DT. The three-classifier ensembles included SVM - NB - DT. 2.2. Evaluated measurement The performance measures that were used to assess the prediction of the proposed system was the accuracy. It is clear that the accuracy is the most important issue to estimate the result. Accuracy can be defined as the degree of uncertainty in a measurement with respect to an absolute standard. The predictive accuracy of a classification algorithm is calculated as follows. tp + tn Accuracy = (1) tp + fp + tn + fn Two extended versions of accuracy are precision and sensitivity. Precision measures the reproducibility of a measurement, whereas sensitivity – also called recall – measures the com- 594
  4. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 pleteness. Precision in Eq. (2) is defined as the number of true positives as a proportion of the total number of true positives and false positives that are provided by the classifier. Sensitivity in Eq. (3) is the number of correctly classified positive examples divided by the number of positive examples in the data. In identifying positive labels, sensitivity is useful for estimating the effec- tiveness of a classifier. tp Precíion = (2) tp + fp tp Sesitivity = (3) tp + fp 3. Numerical example: forecasting the bank maketing 3.1. Data preparation We obtained dataset from University of California, Irvine (UCI) machine learning reposi- tory website (http://archive.ics.uci.edu/ml). The dataset includes a total of 17 features including response variable. The classification goal is to predict if the client will subscribe a term deposit (variable y). The dataset gives information about a marketing campaign of a financial institution in which people will have to analyze in order to find ways to look for future strategies and improve future marketing campaigns for the bank. Table 1 showed the description of bank maketing dataset Table 1. Description of bank maketing dataset Variable Description age numeric, age of client job categorical, type of job (admin, unknown, unemployed, management, housemaid marital categorical, marital status (married, divorced, single. Here “divorced” states the both education categorical (unknown, secondary, primary and tertiary) default binary, customer credit is in default (yes,no) balance numeric, average yearly balance (in euros) housing binary, status of housing loan (yes,no) loan binary, clients personal loan (yes,no) contact categorical, contact communication type (unknown, telephone, cellular) day numeric, the last contact day of the month range (1-31) month categorical, last contact month of the year duration numeric, last contact duration (in seconds) campaign numeric, number of contacts performed during this campaign pdays numeric, number of days that passed by after the client was last contacted from a pre- vious campaign previous numeric, number of contacts which are made before this campaign poutcome categorical, result or outcome of the previous marketing campaign y binary, (desired target) output variable whether client subscribed a term deposit or not 595
  5. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 It is very important to clear original data which is first stage in the process of data mining. Besides, the results of modelling depends on the data accuracy. Following the research (Moro, Cortez et al. 2014), this study also choose 9 attribites from 16 attribitue affected to bank maketing campaign, including age, job, marital, education, default, housing, loan, duration, campaign and one response variable. After that, the authhor removed the missing and unknow values from the data to ensure the data quality. Therefore, the final data in this study contains 43194 instances. 3.2. Analytical results The performance of the models was evaluated in term of accuracy, precision and sensitivity. Whereas, accuracy is the most commonly used index, therefore high values of accuracy indicate favorable performance and vice versa. Table 2 compares the performances of the individual mod- els – SVM, NB and DT when predicting bank maketing data. Table 2 showed that with individual model DT the highest results in terms of accuracy, precision and sensitivity (88.536%, 87.823% and 86.802% respectively). From the table 2, we can see SVM-NB-DT model yielded the best performance compared to other models wit 89.957% of accuracy, 89.087% of precision and 87.478% of sensitivity. The study wanted to demonstrate the popular apprearance of SVM, NB and DT models in handling classification issue. Besides, the author combined these models by applying voting strategy to create the best model which can improve the propsed dataset. Obvi- ously, the combination of three individual models achieved better performance than those ob- tained individual models. Fig.1 showed the performance comparison of all models in term of the accuracy, precison and sensitivity. Table 2. Prediction performance comparison Model Accuray (%) Precision (%) Sensitivity (%) SVM 88.375 87.402 78.135 NB 88.308 87.811 86.024 DT 88.536 87.823 86.802 SVM-NB 88.975 88.611 78.106 SVM-DT 88.465 88.332 79.414 NB-DT 88.931 88.902 87.124 SVM-NB-DT 89.957 89.087 87.478 Fig. 1. Performance comparison of all models 596
  6. INTERNATIONAL CONFERENCE FOR YOUNG RESEARCHERS IN ECONOMICS & BUSINESS 2020 ICYREB 2020 4. Conclusion In this paper, the author proposed the voting-models which were combined from three in- dividual models (SVM, NB and DT) in classifying bank maketing dataset. Three performance measures, including accuracy, precision and sensitivity were utilized to compare the predictive performance of examined models. Overall, the voting-based SVM- NB - DT was the best model for this dataset. Partically, it achieved the highest values of accuracy, precision and sensitivity with 89.957%, 89.087% and 87.478%, respectively. In the future works, the author hopes to create more effective voting – ensemble models which can apply more datasets. REFERENCES Abbas, S. (2015). “Deposit subscribe Prediction using Data Mining Techniques based Real Marketing Dataset.” International Journal of Computer Applications 110: 1-7. Asif, M. (2018). Predicting the Success of Bank Telemarketing using various Classification Algorithms. Gholami, R., V. Rasouli and A. Alimoradi (2013). “Improved RMR Rock Mass Classifica- tion Using Artificial Intelligence Algorithms.” Rock Mechanics and Rock Engineering 46(5): 1199-1209. Hu, X. (2005). “A Data Mining Approach for Retailing Bank Customer Attrition Analysis.” Applied Intelligence 22(1): 47-60. I. Kuncheva, L. (2007). Combining Pattern Classifiers: Methods and Algorithms. Kittler, J., M. Hatef, R. P. W. Duin and J. Matas (1998). “On combining classifiers.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3): 226-239. Langley, P., Iba, W., & Thompson, K. (1992). “An analysis of Bayesian classifiers.” Pro- ceedings of the Tenth National Conference on Artificial Intelligence, 223-228. Li, W., X. Wu, Y. Sun and Q. Zhang (2011). Credit Card Customer Segmentation and Target Marketing Based on Data Mining. Liu, Y., H. Wang, H. Zhang and K. Liber (2016). “A comprehensive support vector machine- based classification model for soil quality assessment.” Soil and Tillage Research 155: 19-26. Moro, S., P. Cortez and P. Rita (2014). “A Data-Driven Approach to Predict the Success of Bank Telemarketing.” Decision Support Systems 62. Ren, J., S. D. Lee, X. Chen, B. Kao, R. Cheng and D. Cheung (2009). Naive Bayes Clas- sification of Uncertain Data. 2009 Ninth IEEE International Conference on Data Mining. Vapnik, V. N. (1995). The nature of statistical learning theory. New York, Springer- Verlag. Zhang, H., F. Yang, Y. Li and H. Li (2015). “Predicting profitability of listed construction companies based on principal component analysis and support vector machine—Evidence from China.” Automation in Construction 53: 22-28. Zhang, W. and F. Gao (2011). “An Improvement to Naive Bayes for Text Classification.” Procedia Engineering 15: 2160-2164. 597
nguon tai.lieu . vn