Credit scoring based on semi-supervised generalized additive logistic regression
FANG Kuangnan1,2, CHEN Zilan3
1. Department of Statistics, School of Economics, Xiamen University, Xiamen 361000, China; 2. The MOE Key Laboratory of Econometrics(Xiamen University), Xiamen 361000, China; 3. Student Affairs Office, Xiamen University, Xiamen 361000, China
Abstract:The traditional credit scoring model is mainly based on the supervised learning method. However, in the actual loan problem, the labeled sample information is often acquired at a higher cost and longer cycle, while the unlabeled sample information is abundant. In order to make full use of the information of unlabeled samples in modeling, this paper proposes a credit scoring model based on semi-supervised generalized additive (SSGA) logistic regression. The model can not only use both labeled and unlabeled sample information, but also realize the estimation of model parameters and the selection of significant variables. The simulation experiments show that the proposed model has significantly better performance in extrapolation prediction and variable selection than the supervised model. Finally, the model is applied to the assessment of personal credit loan default risk.
[1] West D. Neural network credit scoring models[J]. Computers & Operations Research, 2000, 27(11):1131-1152. [2] Huang C L, Chen M C, Wang C J. Credit scoring with a data mining approach based on support vector machines[J]. Expert Systems with Applications, 2007, 33(4):847-856. [3] 方匡南, 吴见彬, 朱建平,等. 信贷信息不对称下的信用卡信用风险研究[J]. 经济研究, 2010(S1):97-107.Fang K N, Wu J B, Zhu J P, et al. Forecasting of credit card credit risk under asymmetric information based on nonparametric random forests[J]. Economic Research Journal, 2010(S1):97-107. [4] 胡心瀚, 叶五一, 缪柏其. 上市公司信用风险分析模型中的变量选择[J]. 数理统计与管理, 2012, 31(6):1117-1124. Hu X H, Ye W Y, Miao B Q. Variable selection in credit risk models for Chinese listed companies[J]. Journal of Applied Statistics and Management, 2012, 31(6):1117-1124. [5] 张国政, 陈维煌, 刘呈辉. 基于Logistic模型的商业银行个人消费信贷风险评估研究[J]. 金融理论与实践, 2015(3):53-57. Zhang G Z, Chen W H, Liu C H. Research of commercial bank personal credit risk evaluation based on logistic model[J]. Financial Theory and Practice, 2015(3):53-57. [6] 方匡南, 章贵军, 张惠颖. 基于Lasso-logistic模型的个人信用风险预警方法[J]. 数量经济技术经济研究, 2014(2):125-136. Fang K N, Zhang G J, Zhang H Y. Individual credit risk prediction method:Application of a lasso-logistic model[J]. The Journal of Quantitative & Technical Economics, 2014(2):125-136. [7] Engle R F, Granger C W J, Rice J, et al. Semiparametric estimates of the relation between weather and electricity sales[J]. Journal of the American Statistical Association, 1986, 81(394):310-320. [8] 王小明. 关于一类广义可加违约概率模型的探讨[J]. 系统工程理论与实践, 2008, 28(6):52-58.Wang X M. Study on evaluation of default probability based on generalized additive models[J]. Systems Engineering-Theory & Practice, 2008, 28(6):52-58. [9] 张娟, 张贝贝. 基于Group-LASSO方法的广义半参数可加信用评分模型应用研究[J]. 数理统计与管理, 2016, 35(3):517-524.Zhang J, Zhang B B. The application of generalized semi-parametric additive credit score model based on group-lasso method[J]. Journal of Applied Statistics and Management, 2016, 35(3):517-524. [10] Zhu X J. Semi-supervised learning literature survey[R]. University of Wisconsin-Madison, Department of Computer Sciences, 2005. [11] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(1):1-38. [12] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ACM, 1998:92-100. [13] Zhang W, Tang X, Yoshida T. Tesc:An approach to text classification using semi-supervised clustering[J]. Knowledge-Based Systems, 2015, 75:152-160. [14] Gao Y, Ma J, Yuille A L. Semi-supervised sparse representation based classification for face recognition with insufficient labeled samples[J]. IEEE Transactions on Image Processing, 2017, 26(5):2545-2560. [15] Zhang J, Peng Y. SSDH:Semi-supervised deep hashing for large scale image retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(1):212-225. [16] Badrinarayanan V, Budvytis I, Cipolla R. Semi-supervised video segmentation using tree structured graphical models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(11):2751-2764. [17] Yu Z, Luo P, You J, et al. Incremental semi-supervised clustering ensemble for high dimensional data clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(3):701-714. [18] Meng M, Wei J, Wang J, et al. Adaptive semi-supervised dimensionality reduction based on pairwise constraints weighting and graph optimizing[J]. International Journal of Machine Learning and Cybernetics, 2017, 8(3):793-805. [19] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables[J]. Journal of the Royal Statistical Society, 2006, 68(1):49-67. [20] 王小燕, 方匡南, 谢邦昌. Logistic回归的双层变量选择研究[J]. 统计研究, 2014, 31(9):107-112.Wang X Y, Fang K N, Xie B C. Research on bi-level variable selection for logistic regression[J]. Statistical Research, 2014, 31(9):107-112.