广西师范大学学报(自然科学版) ›› 2022, Vol. 40 ›› Issue (3): 132-140.doi: 10.16088/j.issn.1001-6600.2021071201

• 研究论文 • 上一篇    下一篇

基于改进Stacking策略的钓鱼网站检测研究

胡强, 刘倩, 周杭霞*   

  1. 中国计量大学 信息工程学院, 浙江 杭州 310018
  • 收稿日期:2021-07-12 修回日期:2021-08-03 出版日期:2022-05-25 发布日期:2022-05-27
  • 通讯作者: 周杭霞(1963—), 女, 浙江杭州人, 中国计量大学教授。E-mail: zhx@cjlu.edu.cn
  • 基金资助:
    公安部重点实验室开放课题(2021DSJSYS004); 浙江省基础公益研究计划项目(LGF18F020017)

Study on Phishing Website Detection Based on Improved Stacking Strategy

HU Qiang, LIU Qian, ZHOU Hangxia*   

  1. College of Information Engineering, China Jiliang University, Hangzhou Zhejiang 310018, China
  • Received:2021-07-12 Revised:2021-08-03 Online:2022-05-25 Published:2022-05-27

摘要: 针对目前大多数钓鱼网站检测技术准确率低、计算资源消耗大和检测不及时等问题,本文提出一种基于改进Stacking策略的钓鱼网站检测方法。该方法将多个分类表现优异的基学习器通过Stacking策略集成为一个高性能模型,并且把该Stacking算法第一级的输入特征与预测结果同时作为第二级的输入特征,充分发挥各模型精度高、速度快等优势,从而进一步提高模型性能。实验结果表明,与传统的机器学习钓鱼网站检测技术相比,在10万级数据集上,此集成学习算法在多个指标上都表现出更好的性能,精确率达到了97.82%,F1值达到97.54%,可以有效地检测钓鱼网站。

关键词: 钓鱼网站, 基学习器, Stacking算法, 特征提取, 集成学习

Abstract: Aiming at the problems of low accuracy of most detection technologies for phishing websites, high consumption of computing resources and untimely detection, a phishing website detection method based on an improved Stacking strategy is proposed. This method integrates multiple base learners with excellent classification performance into a high-performance model through stacking strategy, and takes the input characteristics and prediction results of the first level of the stacking algorithm as the input characteristics of the second level at the same time, so as to give full play to the advantages of high precision and fast speed of each model, and further improve the performance of the model. Experimental results show that, compared with traditional machine learning phishing website detection technology, this integrated learning algorithm on a 100,000-level data set shows better performance on multiple indicators, with accuracy rate of 97.82% and F1 value reach 97.54%, which can effectively detect phishing websites.

Key words: phishing website, base learner, Stacking algorithm, feature extraction, ensemble learning

中图分类号: 

  • TP393.08
[1]BELL S, KOMISARCZUK P. An analysis of phishing blacklists: Google safe browsing, openphish, and phishtank[C]// Proceedings of the Australasian Computer Science Week Multiconference. New York, NY: ACM Press, 2020:Article 3. DOI: 10.1145/3373017.3373020.
[2]黄长慧,胡光俊,李海威. 基于URL智能白名单的Web应用未知威胁阻断技术研究[J].信息网络安全,2021, 21(3): 1-6. DOI: 10.3969/j.issn.1671-1122.2021.03.001.
[3]弋晓洋,张健.基于图像的网络钓鱼邮件检测方法研究[J].信息网络安全,2021, 21(9): 52-58. DOI: 10.3969/j.issn.1671-1122.2021.09.008.
[4]RAO R S, PAIS A R.Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach[J].Journal of Ambient Intelligence and Humanized Computing,2020, 11(9): 3853-3872. DOI: 10.1007/s12652-019-01637-z.
[5]CHEN J L, MA Y W, HUANG K L. Intelligent visual similarity-based phishing websites detection[J].Symmetry,2020, 12(10):1681. DOI: 10.3390/sym12101681.
[6]MAO J, BIAN J D, TIAN W Q, et al. Phishing page detection via learning classifiers from page layout feature[J].EURASIP Journal on Wireless Communications and Networking, 2019, 2019: 43. DOI: 10.1186/s13638-019-1361-0.
[7]卜佑军,张桥,陈博,等.基于CNN和BiLSTM的钓鱼URL检测技术研究[J].郑州大学学报(工学版),2021,42(6): 1-7. DOI: 10.13705/j.issn.1671-6833.2021.04.022.
[8]YANG L Q, ZHANG J W, WANG X Z, et al.An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features[J]. Expert Systems with Applications,2021, 165: 113863. DOI: 10.1016/j.eswa.2020.113863.
[9]朱琪,林果园. 基于改进随机森林算法的钓鱼网站检测方法研究[J].微电子学与计算机,2019, 36(4): 43-46,51. DOI: 10.19304/j.cnki.issn1000-7180.2019.04.009.
[10]毕青松,梁雪春,陈舒期. 基于mRMR-RF特征选择和XGBoost模型的钓鱼网站检测[J]. 计算机应用与软件,2020, 37(9): 296-301. DOI: 10.3969/j.issn.1000-386x.2020.09.049.
[11]周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017, 40(6): 1229-1251. DOI: 10.11897/SP.J.1016.2017.01229.
[12]冯健.基于主辅特征和深度学习的钓鱼网页检测方法[J].计算机工程与设计,2021, 42(10): 2748-2754. DOI: 10.16208/j.issn1000-7024.2021.10.007.
[13]余恩泽,努尔布力,于清. 一种基于集成学习的钓鱼网站检测方法[J].计算机工程与应用,2019, 55(18): 81-88,200. DOI: 10.3778/j.issn.1002-8331.1812-0362.
[14]FRIEDMAN J H.Greedy function approximation: a gradient boosting machine[J].Annals of Statistics,2001,29(5): 1189-1232. DOI: 10.1214/aos/1013203451.
[15]CHEN T Q, GUESTRIN C. XGBoost: A scalable tree boosting system[C]// KDD′16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2016:785-794.DOI: 10.1145/2939672.2939785.
[16]徐国天,沈耀童. 基于XGBoost和LightGBM双层模型的恶意软件检测方法[J]. 信息网络安全,2020, 20(12): 54-63. DOI: 10.3969/j.issn.1671-1122.2020.12.008.
[17]KE G L, MENG Q, FINLEY T, et al. LightGBM: a highly efficient gradient boosting decision tree[C]// NIPS′17: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). Red Hook: Curran Associates Inc., 2017: 3149-3157.
[18]ZHOU Z H, FENG J. Deep forest[J]. National Science Review,2019, 6(1): 74-86. DOI: 10.1093/nsr/nwy108.
[19]WOLPERT D H. Stacked generalization[J]. Neural Networks,1992, 5(2): 241-259. DOI: 10.1016/S0893-6080(5)80023-1.
[20]BREIMAN L.Stacked regressions[J]. Machine Learning,1996, 24(1): 49-64. DOI: 10.1007/BF00117832.
[21]POWERS D M W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation[J]. Journal of Machine Learning Technologies, 2011,2(1):37-63.
[22]BASIT A, ZAFAR M, LIU X, et al. A comprehensive survey of AI-enabled phishing attacks detection techniques[J].Telecommunication Systems,2020, 76: 139-154. DOI: 10.1007/s11235-020-00733-2.
[23]LAKSHMI L, REDDY M P, SANTHAIAH C, et al. Smart phishing detection in web pages using supervised deep learning classification and optimization technique ADAM[J].Wireless Personal Communications, 2021, 118: 3549-3564. DOI: 10.1007/s11277-021-08196-7.
[24]YUAN J T, CHEN G X, TIAN S W, et al. Malicious URL detection based on a parallel neural joint model[J].IEEE Access,2021, 9: 9464-9472. DOI: 10.1109/ACCESS.2021.3049625.
[25]PARRA G D L T, RAD P, RAYMOND K K, et al. Detecting internet of things attacks using distributed deep learning[J].Journal of Network and Computer Applications,2020, 163: 102662. DOI: 10.1016/j.jnca.2020.102662.
[26]TAJADDODIANFAR F, STOKES J W, GURURAJAN A. Texception: a character/word-level deep learning model for phishing URL detection[C]// 45th International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE,2020: 2857-2861. DOI: 10.1109/ICASSP40776.2020.9053670.
[1] 段美玲, 潘巨龙. 基于双向LSTM神经网络可穿戴跌倒检测研究[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 141-150.
[2] 吴玲玉, 蓝洋, 夏海英. 基于卷积神经网络的眼底图像配准研究[J]. 广西师范大学学报(自然科学版), 2021, 39(5): 122-133.
[3] 马玲, 罗晓曙, 蒋品群. 一种基于PNN的点阵喷码字符识别方法[J]. 广西师范大学学报(自然科学版), 2020, 38(4): 32-41.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 艾艳, 贾楠, 王媛, 郭静, 潘东东. 多性状多位点遗传关联分析的统计方法研究及其应用进展[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 1 -14 .
[2] 白德发, 徐欣, 王国长. 函数型数据广义线性模型和分类问题综述[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 15 -29 .
[3] 曾庆樊, 秦永松, 黎玉芳. 一类空间面板数据模型的经验似然推断[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 30 -42 .
[4] 张治飞, 段谦, 刘乃嘉, 黄磊. 基于Jackknife互信息的高维非线性回归模型研究[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 43 -56 .
[5] 杨迪, 方扬鑫, 周彦. 基于MEB和SVM方法的新类别分类研究[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 57 -67 .
[6] 陈钟秀, 张兴发, 熊强, 宋泽芳. 非对称DAR模型的估计与检验[J]. 广西师范大学学报(自然科学版), 2022, 40(1): 68 -81 .
[7] 杜锦丰, 王海荣, 梁焕, 王栋. 基于表示学习的跨模态检索方法研究进展[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 1 -12 .
[8] 李慕航, 韩萌, 陈志强, 武红鑫, 张喜龙. 面向复杂高效用模式的挖掘算法综述[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 13 -30 .
[9] 晁睿, 张坤丽, 王佳佳, 胡斌, 张维聪, 韩英杰, 昝红英. 中文多模态知识库构建[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 31 -39 .
[10] 李正光, 陈恒, 林鸿飞. 基于双向语言模型的社交媒体药物不良反应识别[J]. 广西师范大学学报(自然科学版), 2022, 40(3): 40 -48 .
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发