广西师范大学学报(自然科学版) ›› 2019, Vol. 37 ›› Issue (1): 142-148.doi: 10.16088/j.issn.1001-6600.2019.01.016

• 第二十四届全国信息检索学术会议专栏 • 上一篇    下一篇

具有两类上限条件的虚拟样本生成数量优化

林越1,2, 刘廷章2*, 王哲河1   

  1. 1.海南热带海洋学院理学院,海南三亚572022;
    2.上海大学自动化学院,上海200072
  • 收稿日期:2018-06-18 出版日期:2019-01-20 发布日期:2019-01-08
  • 通讯作者: 刘廷章(1967—),男,山西原平人,上海大学教授,博导。E-mail:liutzhcom@163.com
  • 基金资助:
    国家自然科学基金(61273190); 海南省自然科学基金(117150)

Quantity Optimization of Virtual Sample Generation with Two Kinds of Upper Bound Conditions

LIN Yue1,2,LIU Tingzhang2*,WANG Zhehe1   

  1. 1.College of Science, Hainan Tropical Ocean University, Sanya Hainan 572022,China;
    2.College of Automation, Shanghai University, Shanghai 200072, China
  • Received:2018-06-18 Online:2019-01-20 Published:2019-01-08

摘要: 面对小样本数据集,虚拟样本生成(virtual sample generation, VSG)技术已被证实能有效提升机器学习算法的性能,然而对于最优的生成数量并未有明确的结论。本文首先在给定训练样本标准方差上限的条件下,采用信息熵理论研究最优虚拟样本生成数量;其次将虚拟样本所产生的噪声加以考虑,在给定的置信水平(0.95)下建立了最优虚拟样本生成数量的一般概率模型及分析方法;最后以2016年浙江湖州某变电站历史监测故障数据建立小样本数据集,设计4次相关虚拟样本生成实验,结果表明,上述两种最优虚拟样本生成数量法则行之有效,相应的机器学习预测精度有所提高。

关键词: 小样本, 机器学习, 虚拟样本, 信息熵, 置信水平

Abstract: With small sample data sets, the virtual sample generation technology has been proved to effectively improve the performance of machine learning algorithm. However, there is no definite conclusion for the optimal generation number. First of all, under the condition of the limit of standard variance of a given training sample, the information entropy theory is proposed to study the number of optimal virtual sample generation. In addition, the noise generated by virtual sample generation is taken into account and a general probability model and the analysis method of the number of optimal virtual samples are established at a given confidence level (0.95). A small sample data set is set up based on the historical monitoring fault data of a substation in Huzhou, Zhejiang, in 2016 and a four virtual sample generation experiment is designed. The results show that the two optimal virtual sample generation rules are effective, and the accuracy of the corresponding machine learning prediction is obviously improved.

Key words: small sample, machine learning, virtual sample, information entropy, confidence level

中图分类号: 

  • TP181
[1] 陈潭.大数据战略实施的实践逻辑与行动框架[J].中共中央党校学报,2017,21(2):19-26.DOI:10.14119/j.cnki.zgxb.2017.02.003.
[2] 郭毅可.走好我们的大数据之路[J].上海大学学报(自然科学版),2016,22(1):1-2.DOI:10.3969/j.issn.1007-2861.2015. 05.016.
[3] 宫夏屹,李伯虎,柴旭东,等.大数据平台技术综述[J].系统仿真学报,2014,26(3):489-496.DOI:10.16182/j.cnki.joss. 2014.03.039.
[4] EFRON B,TIBSHIRANI R J. An introduction to the bootstrap[M]. New York: Chapmen and Hall, 1993.
[5] TSAI T I, LI D C. Utilize bootstrap in small data set learning for pilot run modeling of manufacturing systems[J]. Expert Systems with Applications,2008,35(3):1293-1300.DOI:10.1016/j.eswa.2007.08.043.
[6] HUANG Chongfu,MORAGA C.A diffusion-neural-network for learning from small samples[J].International Journal of Approximate Reasoning,2004,35(2):137-161.DOI:10.1016/j.ijar.2003.06.001.
[7] LI D C,WU C S,CHANG F M.Using data-fuzzification technology in small data set learning to improve FMS scheduling accuracy[J].The International Journal of Advanced Manufacturing Technology,2005,27(3/4):321-328.DOI: 10.1007/s00170-003-2184-y.
[8] LI D C, WU C S,TSAI T I,et al.Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge[J].Computers and Operations Research,2007,34(4):966-982.DOI: 10.1016/j.cor.2005.05.019.
[9] LIN Y S,LI D C.The generalized-trend-diffusion modeling algorithm for small data sets in the early stages of manufacturing systems[J].European Journal of Operational Research,2010,207(1):121-130.DOI:10.1016/j.ejor.2010.03. 026.
[10] LI D C,CHEN C C,CHANG C J,et al.A tree-based-trend-diffusion prediction procedure for small sample sets in the early stages of manufacturing systems[J].Expert Systems with Applications,2012,39(1):1557-1581.DOI:10.1016/j.eswa. 2011.08.071.
[11] 朱宝,陈忠圣,余乐安.一种新颖的小样本整体趋势扩散技术[J].化工学报,2016,67(3):820-826.DOI:10.11949/j.issn. 0438-1157.20151921.
[12] CHEN Zhongsheng,ZHU Bao,HE Yanlin, et al.A PSO based virtual sample generation method for small sample sets:applications to regression datasets[J].Engineering Applications of Artificial Intelligence,2017,59:236-243.DOI:10. 1016/j.engappai.2016.12.024.
[13] YANG Jing,YU Xu,XIE Zhiqiang,et al.A novel virtual sample generation method based on Gaussian distribution[J]. Knowledge-Based Systems,2011,24(6):740-748.DOI:10.1016/j.knosys.2010.12.010.
[14] 徐中民,张志强,程国栋,等.运用信息熵理论研究条件估值调查中的抽样问题[J].系统工程理论与实践,2003(3):129-134.DOI:10.3321/j.issn:1000-6788.2003.03.023.
[15] 林耀三,张延全,张哲荣,等.虚拟样本合适性筛选机制[C]//第25届全国灰色系统会议论文集.北京:中国高等科学技术中心,2014:372-379.
[16] 王松桂,张忠占,程维虎,等.概率论与数理统计[M].北京:科学出版社,2004:120-127.
[1] 张永生, 朱文焌, 史若琪, 杜振华, 张瑞, 王志. 基于可信度的Android恶意代码多模型协同检测方法[J]. 广西师范大学学报(自然科学版), 2020, 38(2): 19-28.
[2] 徐丽, 丁世飞, 郭锋锋. 基于改进属性约简的粗核聚类算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 105-109.
[3] 张仁津, 唐翠芳, 刘彬. 基于人工神经网络游戏程序的研究和设计[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 119-124.
[4] 徐久成, 李晓艳, 李双群, 张灵均. 基于相容粒的多层次纹理特征图像检索方法[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 186-187.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发