广西师范大学学报(自然科学版) ›› 2020, Vol. 38 ›› Issue (2): 72-80.doi: 10.16088/j.issn.1001-6600.2020.02.008

• CCIR2019 • 上一篇    下一篇

一种面向不平衡分类的改进多决策树算法

段化娟1,2, 尉永清2,3*, 刘培玉1,2, 周鹏1,2   

  1. 1. 山东师范大学信息科学与工程学院,山东济南250358;
    2. 山东省分布式计算机软件新技术重点实验室,山东济南250358;
    3. 山东警察学院公共基础部,山东济南250014
  • 收稿日期:2019-10-10 发布日期:2020-04-02
  • 通讯作者: 尉永清(1963—),女,山东济南人,山东警察学院教授。E-mail:weiyongqing@sdpc.edu.cn
  • 基金资助:
    国家社会科学基金(19BYY076);山东省社科规划项目(18CXWJ01)

An Improved Multi-decision Tree Algorithm for Imbalanced Classification

DUAN Huajuan1,2, WEI Yongqing2,3*, LIU Peiyu1,2, ZHOU Peng1,2   

  1. 1. School of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250358,China;
    2. Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan Shandong 250358,China;
    3. Basic Education Department,Shandong Police College,Jinan Shandong 250014,China
  • Received:2019-10-10 Published:2020-04-02

摘要: 在处理不平衡数据集时,为了降低类重叠对分类效果的影响,避免过采样造成的过拟合现象,以及欠采样造成的信息丢失问题,本文提出一种基于欠采样与属性选择的多决策树方法UAMDT(multi-decision tree based on under-sampling and attribute selection)。其首先利用Tomek link欠采样与集成欠采样两种技术相结合对数据进行处理,并获得多个平衡子集;然后在每个平衡子集上构建单决策树,采用结合信息增益和基尼指数的混合属性度量作为属性选择标准,选择最优属性作为每棵单决策树的根节点的分裂属性;最后将单决策树进行集成构建多决策树。通过对10个不平衡数据集的多个评估指标进行实验,验证了本文算法的有效性和可行性。

关键词: 不平衡数据, 多决策树, Tomek link欠采样, 集成欠采样, 属性选择

Abstract: When dealing with imbalanced datasets, in order to reduce the impact of class overlapping on classification effect, and avoid over-fitting caused by over-sampling and information loss attributed to under-sampling, a Multi-decision tree based on Under-sampling and Attribute selection called UAMDT is proposed. First, Tomek link under-sampling and Ensemble Under-sampling are used for data processing, and many balanced subsets are obtained. Furthermore, single decision tree is constructed on each subset, the hybrid attribute measure of information gain and Gini index as attribute selection criteria are used and the optimal attribute as the split attribute of the root node of each single decision tree is selected, and finally all the single decision trees are integrated to build a multi-decision tree. In this paper, the experiments with multiple evaluation criteria on 10 imbalanced datasets are conducted to verify the effectiveness and feasibility of the proposed algorithm.

Key words: imbalanced data, multi-decision tree, Tomek link under-sampling, ensemble under-sampling, attribute selection

中图分类号: 

  • TP391
[1] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J].计算机科学,2018,45(S1):22-27,57.
[2] 温雪岩,陈家男,景维鹏,等.面向不平衡数据集分类模型的优化研究[J].计算机工程,2018,44(4):268-273,293.DOI: 10.3969/j.issn.1000-3428.2018.04.043.
[3] 冯力力,李跃波,苏宇,等.对不平衡类分类的一种组合方法[J].广西师范大学学报(自然科学版),2007,25(4):277-280. DOI:10.16088/j.issn.1001-6600.2007.04.051.
[4] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.DOI: 10.1613/jair.953.
[5] MA Li,FAN Suohai.CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J].BMC bioinformatics,2017,18:169.DOI:10.1186/s12859-017-1578-z.
[6] 陈斌,苏一丹,黄山.基于KM-SMOTE和随机森林的不平衡数据分类[J].计算机技术与发展,2015,25(9):17-21.DOI: 10.3969/j.issn.1673-629X.2015.09.004.
[7] 盛凯,刘忠,周德超,等.面向不平衡分类的IDP-SMOTE重采样算法[J].计算机应用研究,2019,36(1):115-118.DOI: 10.19734/j.issn.1001-3695.2017.07.0699.
[8] 熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.DOI:10.7544/issn1000-1239.2016.20150593.
[9] TSAI Chihfong,LIN Weichao,HU Yahan,et al.Under-sampling class imbalanced datasets by combining clustering analysis and instance selection[J].Information Sciences,2019,477:47-54.DOI:10.1016/j.ins.2018.10.029.
[10]LIN Weichao,TSAI Chihfong,HU Yahan,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409/410:17-26.DOI:10.1016/j.ins.2017.05.008.
[11]BOLÓN-CANEDO V,ALONSO-BETANZOS A.Ensembles for feature selection:a review and future trends[J]. Information Fusion,2019,52:1-12.DOI:10.1016/j.inffus.2018.11.008.
[12]秦孟梅,邱建林,陆鹏程,等.基于AdaBoost的类不平衡学习算法[J].计算机应用研究,2017,34(11):3229-3232.DOI:10.3969/j.issn.1001-3695.2017.11.006.
[13]王正群,张天平,乐晓蓉,等.基于聚类选择的分类器集成[J].计算机应用研究,2007,24(12):85-87.DOI:10.3969/ j.issn.1001-3695.2007.12.025.
[14]SEIFFERT C,KHOSHGOFTAAR T M,HULSE J V,et al.RUSBoost: a hybrid approach to alleviating class imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans,2010,40(1):185-197.DOI:10.1109/TSMCA.2009.2029559.
[15]BARANDELA R,VALDOVINOS R M,SÁNCHEZ J S.New applications of ensembles of classifiers[J].Pattern Analysis & Applications,2003,6(3):245-256.DOI:10.1007/s10044-003-0192-z.
[16]LIU Xuying,WU Jianxin,ZHOU Zhihua.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems, Man, and Cybernetics,Part B(Cybernetics),2009,39(2):539-550.DOI:10.1109/TSMCB.2008.2007853.
[17]KANG P,CHO S.EUS SVMs:ensemble of under-sampled SVMs for data imbalance problems[C]//International Conference on Neural Information Processing.Berlin:Springer,2006:837-846.DOI:10.1007/11893028_93.
[18]LU Wei,LI Zhe,CHU Jinghui.Adaptive ensemble undersampling-boost:a novel learning framework for imbalanced data [J].Journal of Systems and Software,2017,132:272-282.DOI:10.1016/j.jss.2017.07.006.
[19]PARVIN H,MIRNABIBABOLI M,ALINEJAD-ROKNY H.Proposing a classifier ensemble framework based on classifier selection and decision tree[J].Engineering Applications of Artificial Intelligence,2015,37:34-42.DOI:10.1016/ j.engappai.2014.08.005.
[20]NEJATIAN S,PARVIN H,FARAJI E.Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification[J].Neurocomputing,2018,276:55-66.DOI:10.1016/j.neucom.2017.06.082.
[21]TOMEK I.Two modifications of CNN[J].IEEE Transactions on Systems,Man and Cybernetics,1976,6(11):769-772.DOI: 10.1109/TSMC.1976.4309452.
[22]DEVI D,kr BISWAS S,PURKAYASTHA B.Redundancy-driven modified Tomek-link based undersampling:a solution to class imbalance[J].Pattern Recognition Letters,2017,93:3-12.DOI:10.1016/j.patrec.2016.10.006.
[23]LI Fenglian,ZHANG Xueying,ZHANG Xiqian,et al.Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets[J].Information Sciences,2018,422:242-256.DOI:10.1016/j.ins.2017.09.013.
[1] 郑威,文国秋,何威,胡荣耀,赵树之. 属性自表达的低秩无监督属性选择算法[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 61-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王梦飞, 黄松. 广西西江经济带的城市旅游经济空间关联研究[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 144 -150 .
[2] 冯修, 马楠楠, 职红涛, 韩双乔, 张翔. 重金属捕集剂UDTC对低浓度镉废水的处理研究[J]. 广西师范大学学报(自然科学版), 2018, 36(3): 63 -67 .
[3] 张浩然,蔡德所,林金城,沈炜,黄焜. 龙江与刁江底栖硅藻群落结构及影响因子[J]. 广西师范大学学报(自然科学版), 2018, 36(1): 132 -141 .
[4] 唐堂,罗晓曙,吕万德,刘欣. 四旋翼无人机滑模自抗扰控制[J]. 广西师范大学学报(自然科学版), 2018, 36(2): 56 -62 .
[5] 吴娟,邹华,梅平. 羧酸盐型Gemini表面活性剂的表面性能研究[J]. 广西师范大学学报(自然科学版), 2018, 36(2): 78 -86 .
[6] 李述万. 广西植物名录补遗(Ⅳ)[J]. 广西师范大学学报(自然科学版), 2016, 34(4): 129 -133 .
[7] 王培, 周胜林. 二维典型群PSL(2,q)与旗传递2-(v, k, λ)设计[J]. 广西师范大学学报(自然科学版), 2017, 35(2): 39 -44 .
[8] 党桂兰, 冯慧喆, 唐启明, 莫佛艳, 薛跃规. 广西植物新分布[J]. 广西师范大学学报(自然科学版), 2016, 34(2): 147 -150 .
[9] 许伦辉, 刘景柠, 朱群强, 王晴, 谢岩, 索圣超. 自动引导车路径偏差的控制研究[J]. 广西师范大学学报(自然科学版), 2015, 33(1): 1 -6 .
[10] 邝先验, 吴赟, 曹韦华, 吴银凤. 城市混合非机动车流的元胞自动机仿真模型[J]. 广西师范大学学报(自然科学版), 2015, 33(1): 7 -14 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发