|
广西师范大学学报(自然科学版) ›› 2020, Vol. 38 ›› Issue (2): 72-80.doi: 10.16088/j.issn.1001-6600.2020.02.008
段化娟1,2, 尉永清2,3*, 刘培玉1,2, 周鹏1,2
DUAN Huajuan1,2, WEI Yongqing2,3*, LIU Peiyu1,2, ZHOU Peng1,2
摘要: 在处理不平衡数据集时,为了降低类重叠对分类效果的影响,避免过采样造成的过拟合现象,以及欠采样造成的信息丢失问题,本文提出一种基于欠采样与属性选择的多决策树方法UAMDT(multi-decision tree based on under-sampling and attribute selection)。其首先利用Tomek link欠采样与集成欠采样两种技术相结合对数据进行处理,并获得多个平衡子集;然后在每个平衡子集上构建单决策树,采用结合信息增益和基尼指数的混合属性度量作为属性选择标准,选择最优属性作为每棵单决策树的根节点的分裂属性;最后将单决策树进行集成构建多决策树。通过对10个不平衡数据集的多个评估指标进行实验,验证了本文算法的有效性和可行性。
中图分类号:
[1] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J].计算机科学,2018,45(S1):22-27,57. [2] 温雪岩,陈家男,景维鹏,等.面向不平衡数据集分类模型的优化研究[J].计算机工程,2018,44(4):268-273,293.DOI: 10.3969/j.issn.1000-3428.2018.04.043. [3] 冯力力,李跃波,苏宇,等.对不平衡类分类的一种组合方法[J].广西师范大学学报(自然科学版),2007,25(4):277-280. DOI:10.16088/j.issn.1001-6600.2007.04.051. [4] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.DOI: 10.1613/jair.953. [5] MA Li,FAN Suohai.CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J].BMC bioinformatics,2017,18:169.DOI:10.1186/s12859-017-1578-z. [6] 陈斌,苏一丹,黄山.基于KM-SMOTE和随机森林的不平衡数据分类[J].计算机技术与发展,2015,25(9):17-21.DOI: 10.3969/j.issn.1673-629X.2015.09.004. [7] 盛凯,刘忠,周德超,等.面向不平衡分类的IDP-SMOTE重采样算法[J].计算机应用研究,2019,36(1):115-118.DOI: 10.19734/j.issn.1001-3695.2017.07.0699. [8] 熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.DOI:10.7544/issn1000-1239.2016.20150593. [9] TSAI Chihfong,LIN Weichao,HU Yahan,et al.Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J].Information Sciences,2019,477:47-54.DOI:10.1016/j.ins.2018.10.029. [10]LIN Weichao,TSAI Chihfong,HU Yahan,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409/410:17-26.DOI:10.1016/j.ins.2017.05.008. [11]BOLÓN-CANEDO V,ALONSO-BETANZOS A.Ensembles for feature selection:a review and future trends[J]. Information Fusion,2019,52:1-12.DOI:10.1016/j.inffus.2018.11.008. [12]秦孟梅,邱建林,陆鹏程,等.基于AdaBoost的类不平衡学习算法[J].计算机应用研究,2017,34(11):3229-3232.DOI:10.3969/j.issn.1001-3695.2017.11.006. [13]王正群,张天平,乐晓蓉,等.基于聚类选择的分类器集成[J].计算机应用研究,2007,24(12):85-87.DOI:10.3969/ j.issn.1001-3695.2007.12.025. [14]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.RUSBoost: a hybrid approach to alleviating class imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans,2010,40(1):185-197.DOI:10.1109/TSMCA.2009.2029559. [15]BARANDELA R,VALDOVINOS R M,SÁNCHEZ J S.New applications of ensembles of classifiers[J].Pattern Analysis & Applications,2003,6(3):245-256.DOI:10.1007/s10044-003-0192-z. [16]LIU Xuying,WU Jianxin,ZHOU Zhihua.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems, Man, and Cybernetics,Part B(Cybernetics),2009,39(2):539-550.DOI:10.1109/TSMCB.2008.2007853. [17]KANG P,CHO S.EUS SVMs:ensemble of under-sampled SVMs for data imbalance problems[C]//International Conference on Neural Information Processing.Berlin:Springer,2006:837-846.DOI:10.1007/11893028_93. [18]LU Wei,LI Zhe,CHU Jinghui.Adaptive ensemble undersampling-boost:a novel learning framework for imbalanced data [J].Journal of Systems and Software,2017,132:272-282.DOI:10.1016/j.jss.2017.07.006. [19]PARVIN H,MIRNABIBABOLI M,ALINEJAD-ROKNY H.Proposing a classifier ensemble framework based on classifier selection and decision tree[J].Engineering Applications of Artificial Intelligence,2015,37:34-42.DOI:10.1016/ j.engappai.2014.08.005. [20]NEJATIAN S,PARVIN H,FARAJI E.Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification[J].Neurocomputing,2018,276:55-66.DOI:10.1016/j.neucom.2017.06.082. [21]TOMEK I.Two modifications of CNN[J].IEEE Transactions on Systems,Man and Cybernetics,1976,6(11):769-772.DOI: 10.1109/TSMC.1976.4309452. [22]DEVI D,kr BISWAS S,PURKAYASTHA B.Redundancy-driven modified Tomek-link based undersampling:a solution to class imbalance[J].Pattern Recognition Letters,2017,93:3-12.DOI:10.1016/j.patrec.2016.10.006. [23]LI Fenglian,ZHANG Xueying,ZHANG Xiqian,et al.Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets[J].Information Sciences,2018,422:242-256.DOI:10.1016/j.ins.2017.09.013. |
[1] | 郑威,文国秋,何威,胡荣耀,赵树之. 属性自表达的低秩无监督属性选择算法[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 61-69. |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |