Journal of Guangxi Normal University(Natural Science Edition) ›› 2022, Vol. 40 ›› Issue (3): 185-193.doi: 10.16088/j.issn.1001-6600.2021071801

Previous Articles     Next Articles

Data-driven Method for Automatic Machine Learning Pipeline Generation

CHEN Gaojian, WANG Jing*, LI Qianwen, YUAN Yunjing, CAO Jiachen   

  1. Beijing Key Laboratory on Integration and Analysis of Large-Scale Stream Data, North China University of Technology, Beijing 100144, China
  • Received:2021-07-18 Revised:2021-09-09 Online:2022-05-25 Published:2022-05-27

Abstract: Automatic Machine Learning (AutoML) is an important issue at the forefront of machine learning. Automatic machine learning tools compose machine learning primitives to construct pipelines based on datasets and task requirements, so that domain users can complete corresponding data analysis work without professional machine learning knowledge. However, current automatic machine learning tools generally suffer from the problems of long-time consumption and low precision. A data-driven method for automatic machine learning pipeline generation based on the principles of dataset similarity and reinforcement learning is proposed in this paper. This method uses the historical knowledge of similar datasets to guide the generation of machine learning pipelines. The experimental results show that the time-consumption of the method proposed in this paper is shortened to the minute level, and the pipeline performance is also improved.

Key words: automatic machine learning, dataset similarity, MCTS, reinforcement learning

CLC Number: 

  • TP181
[1]HUTTER F, KOTTHOFF L, VANSCHOREN J. Automated machine learning: methods, systems, challenges[M]. Berlin: Springer, 2019.
[2]VANSCHOREN J, VAN RIJN J N, BISCHL B, et al. OpenML: networked science in machine learning[J]. ACM SIGKDD Explorations Newsletter, 2013, 15(2): 49-60. DOI: 10.1145/2641190.2641198.
[3]BAYDIN A G, PEARLMUTTER B A, RADUL A A, et al. Automatic differentiation in machine learning: a survey[J]. The Journal of Machine Learning Research, 2017, 18(1): 5595-5637.
[4]崔佳旭,杨博. 贝叶斯优化方法和应用综述[J]. 软件学报,2018,29(10):3068-3090.
[5]季辉,丁泽军. 双人博弈问题中的蒙特卡洛树搜索算法的改进[J]. 计算机科学, 2018, 45(1):140-143.
[6]李智勇,黄滔,陈少淼,等. 约束优化进化算法综述[J]. 软件学报, 2017, 28(6): 1529-1546.DOI: 10.13328/j.cnki.jos.005259.
[7]刘全,翟建伟,章宗长,等. 深度强化学习综述[J]. 计算机学报, 2018,41(1): 1-27. DOI: 10.11897/sp.j.1016.2018.00001.
[8]张爱军, 杨泽斌. 自动化机器学习中的超参调优方法[J]. 中国科学:数学, 2020, 50(5):695-710.
[9]KOTTHOFF L, THORNTON C, HOOS H H, et al. Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA[J]. The Journal of Machine Learning Research, 2017, 18(1): 826-830.
[10]FEURER M, KLEIN A, EGGENSPERGER K, et al. Efficient and robust automated machine learning[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. New York: ACM, 2015: 2755-2763.
[11]BERGSTRA J, BARDENET R,BENGIO Y, et al. Algorithms for hyper-parameter optimization[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems.New York: ACM, 2011: 2546-2554.
[12]SWEARINGEN T, DREVO W, CYPHERS B, et al. ATM: a distributed, collaborative, scalable system for automated machine learning[C]// 2017 IEEE International Conference on Big Data(Big Data). Piscataway, NJ:IEEE, 2017: 151-162.
[13]OLSON R S, MOORE J H. TPOT: a tree-based pipeline optimization tool for automating machine learning[C]// Proceedings of the Workshop on Automatic Machine Learning. New York:PMLR ,2016: 66-74.
[14]CHEN B, WU H, MO W, et al. Autostacker: a compositional evolutionary learning system[EB/OL].(2018-03-02)[2021-09-09]. http://arxiv.org/abs/1803.00684.
[15]SILVER D, HUBERT T, SCHRITTWIESER J, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm[EB/OL].(2017-12-05)[2021-09-09]. https://arxiv.org/abs/1712.01815.
[16]DRORI I, KRISHNAMURTHY Y, RAMPIN R, et al. AlphaD3M: machine learning pipeline synthesis[EB/OL].(2021-11-03)[2021-12-20]. https://arxiv.org/abs/2111.02508. DOI: 10.48550/arXiv.2111.02508.
[17]GAMA J, BRAZDIL P. Characterization of classification algorithms[C]// EPIA 1995: Progress in Artificial Intelligence. Berlin: Springer, 1995: 189-200. DOI: 10.1007/3-540-60428-6_16.
[18]张忠林,曹志宇,李元韬. 基于加权欧式距离的k_means算法研究[J]. 郑州大学学报(工学版),2010,31(1):89-92.
[19]颜奇. 基于皮尔逊相关系数的差分隐私决策树方法研究[D].桂林:广西师范大学,2021.
[20]马宏伟,张光卫,李鹏. 协同过滤推荐算法综述[J]. 小型微型计算机系统, 2009, 30(7):1282-1288.
[21]刘婷婷,汪云海,屠长河,等. 基于蒙特卡罗树搜索的树图布局[J]. 计算机辅助设计与图形学学报,2021,33(9):1367-1376.
[22]MCALEER S, AGOSTINELLI F, SHMAKOV A, et al. Solving the Rubik’s cube without human knowledge[EB/OL].(2018-05-18)[2021-09-09]. https://arxiv.org/pdf/1805.07470.pdf.
[23]AYE T T, LEE G K K, SU Y, et al. Layman analytics system: a cloud-enabled system for data analytics workflow recommendation[J]. IEEE Transactions on Automation Science and Engineering, 2016, 14(1): 160-170. DOI:10.1109/TASE.2016.2610521.
[24]WU H C, LUK R W P, WONG K F, et al. Interpreting TF-IDF term weights as making relevance decisions[J]. ACM Transactions on Information Systems, 2008, 26(3):1-37. DOI:10.1145/1361684.1361686.
[25]SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]// Proceedings of the 17th International Conference on Pattern Recognition. Piscataway,NJ: IEEE, 2004: 32-36. DOI: 10.1109/ICPR.2004.1334462.
[26]曾兆伟, 曹健. 数据分析服务流程模型推荐[J].小型微型计算机系统,2019,40(7):1374-1379.
[1] TANG Fengzhu, TANG Xin, LI Chunhai, LI Xiaohuan. Dynamic Task Allocation Method for UAVs Based on Deep Reinforcement Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2021, 39(6): 63-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] AI Yan, JIA Nan, WANG Yuan, GUO Jing, PAN Dongdong. Review of Statistical Methods and Applications of Genetic Association Analysis for Multiple Traits and Multiple Locus[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 1 -14 .
[2] BAI Defa, XU Xin, WANG Guochang. Review of Generalized Linear Models and Classification for Functional Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 15 -29 .
[3] ZENG Qingfan, QIN Yongsong, LI Yufang. Empirical Likelihood Inference for a Class of Spatial Panel Data Models[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 30 -42 .
[4] ZHANG Zhifei, DUAN Qian, LIU Naijia, HUANG Lei. High-dimensional Nonlinear Regression Model Based on JMI[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 43 -56 .
[5] YANG Di, FANG Yangxin, ZHOU Yan. New Category Classification Research Based on MEB and SVM Methods[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 57 -67 .
[6] CHEN Zhongxiu, ZHANG Xingfa, XIONG Qiang, SONG Zefang. Estimation and Test for Asymmetric DAR Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(1): 68 -81 .
[7] DU Jinfeng, WANG Hairong, LIANG Huan, WANG Dong. Progress of Cross-modal Retrieval Methods Based on Representation Learning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 1 -12 .
[8] LI Muhang, HAN Meng, CHEN Zhiqiang, WU Hongxin, ZHANG Xilong. Survey of Algorithms Oriented to Complex High Utility Pattern Mining[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 13 -30 .
[9] CHAO Rui, ZHANG Kunli, WANG Jiajia, HU Bin, ZHANG Weicong, HAN Yingjie, ZAN Hongying. Construction of Chinese Multimodal Knowledge Base[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 31 -39 .
[10] LI Zhengguang, CHEN Heng, LIN Hongfei. Identification of Adverse Drug Reaction on Social Media Using Bi-directional Language Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 40 -48 .