基于改进Stacking策略的钓鱼网站检测研究

doi:10.16088/j.issn.1001-6600.2021071201

广西师范大学学报（自然科学版） ›› 2022, Vol. 40 ›› Issue (3): 132-140.doi: 10.16088/j.issn.1001-6600.2021071201

基于改进Stacking策略的钓鱼网站检测研究

胡强, 刘倩, 周杭霞^*

中国计量大学信息工程学院, 浙江杭州 310018

收稿日期:2021-07-12 修回日期:2021-08-03 出版日期:2022-05-25 发布日期:2022-05-27
通讯作者: 周杭霞(1963—), 女, 浙江杭州人, 中国计量大学教授。E-mail: zhx@cjlu.edu.cn
基金资助:
公安部重点实验室开放课题(2021DSJSYS004); 浙江省基础公益研究计划项目(LGF18F020017)

Study on Phishing Website Detection Based on Improved Stacking Strategy

HU Qiang, LIU Qian, ZHOU Hangxia^*

College of Information Engineering, China Jiliang University, Hangzhou Zhejiang 310018, China

Received:2021-07-12 Revised:2021-08-03 Online:2022-05-25 Published:2022-05-27

摘要/Abstract

摘要： 针对目前大多数钓鱼网站检测技术准确率低、计算资源消耗大和检测不及时等问题,本文提出一种基于改进Stacking策略的钓鱼网站检测方法。该方法将多个分类表现优异的基学习器通过Stacking策略集成为一个高性能模型,并且把该Stacking算法第一级的输入特征与预测结果同时作为第二级的输入特征,充分发挥各模型精度高、速度快等优势,从而进一步提高模型性能。实验结果表明,与传统的机器学习钓鱼网站检测技术相比,在10万级数据集上,此集成学习算法在多个指标上都表现出更好的性能,精确率达到了97.82%,F₁值达到97.54%,可以有效地检测钓鱼网站。

关键词: 钓鱼网站, 基学习器, Stacking算法, 特征提取, 集成学习

Abstract: Aiming at the problems of low accuracy of most detection technologies for phishing websites, high consumption of computing resources and untimely detection, a phishing website detection method based on an improved Stacking strategy is proposed. This method integrates multiple base learners with excellent classification performance into a high-performance model through stacking strategy, and takes the input characteristics and prediction results of the first level of the stacking algorithm as the input characteristics of the second level at the same time, so as to give full play to the advantages of high precision and fast speed of each model, and further improve the performance of the model. Experimental results show that, compared with traditional machine learning phishing website detection technology, this integrated learning algorithm on a 100,000-level data set shows better performance on multiple indicators, with accuracy rate of 97.82% and F₁ value reach 97.54%, which can effectively detect phishing websites.

Key words: phishing website, base learner, Stacking algorithm, feature extraction, ensemble learning

中图分类号:

TP393.08

胡强, 刘倩, 周杭霞. 基于改进Stacking策略的钓鱼网站检测研究[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 132-140.

HU Qiang, LIU Qian, ZHOU Hangxia. Study on Phishing Website Detection Based on Improved Stacking Strategy[J]. Journal of Guangxi Normal University(Natural Science Edition), 2022, 40(3): 132-140.

参考文献

[1]BELL S, KOMISARCZUK P. An analysis of phishing blacklists: Google safe browsing, openphish, and phishtank[C]// Proceedings of the Australasian Computer Science Week Multiconference. New York, NY: ACM Press, 2020:Article 3. DOI: 10.1145/3373017.3373020.
[2]黄长慧,胡光俊,李海威. 基于URL智能白名单的Web应用未知威胁阻断技术研究[J].信息网络安全,2021, 21(3): 1-6. DOI: 10.3969/j.issn.1671-1122.2021.03.001.
[3]弋晓洋,张健.基于图像的网络钓鱼邮件检测方法研究[J].信息网络安全,2021, 21(9): 52-58. DOI: 10.3969/j.issn.1671-1122.2021.09.008.
[4]RAO R S, PAIS A R.Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach[J].Journal of Ambient Intelligence and Humanized Computing,2020, 11(9): 3853-3872. DOI: 10.1007/s12652-019-01637-z.
[5]CHEN J L, MA Y W, HUANG K L. Intelligent visual similarity-based phishing websites detection[J].Symmetry,2020, 12(10):1681. DOI: 10.3390/sym12101681.
[6]MAO J, BIAN J D, TIAN W Q, et al. Phishing page detection via learning classifiers from page layout feature[J].EURASIP Journal on Wireless Communications and Networking, 2019, 2019: 43. DOI: 10.1186/s13638-019-1361-0.
[7]卜佑军,张桥,陈博,等.基于CNN和BiLSTM的钓鱼URL检测技术研究[J].郑州大学学报(工学版),2021,42(6): 1-7. DOI: 10.13705/j.issn.1671-6833.2021.04.022.
[8]YANG L Q, ZHANG J W, WANG X Z, et al.An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features[J]. Expert Systems with Applications,2021, 165: 113863. DOI: 10.1016/j.eswa.2020.113863.
[9]朱琪,林果园. 基于改进随机森林算法的钓鱼网站检测方法研究[J].微电子学与计算机,2019, 36(4): 43-46,51. DOI: 10.19304/j.cnki.issn1000-7180.2019.04.009.
[10]毕青松,梁雪春,陈舒期. 基于mRMR-RF特征选择和XGBoost模型的钓鱼网站检测[J]. 计算机应用与软件,2020, 37(9): 296-301. DOI: 10.3969/j.issn.1000-386x.2020.09.049.
[11]周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017, 40(6): 1229-1251. DOI: 10.11897/SP.J.1016.2017.01229.
[12]冯健.基于主辅特征和深度学习的钓鱼网页检测方法[J].计算机工程与设计,2021, 42(10): 2748-2754. DOI: 10.16208/j.issn1000-7024.2021.10.007.
[13]余恩泽,努尔布力,于清. 一种基于集成学习的钓鱼网站检测方法[J].计算机工程与应用,2019, 55(18): 81-88,200. DOI: 10.3778/j.issn.1002-8331.1812-0362.
[14]FRIEDMAN J H.Greedy function approximation: a gradient boosting machine[J].Annals of Statistics,2001,29(5): 1189-1232. DOI: 10.1214/aos/1013203451.
[15]CHEN T Q, GUESTRIN C. XGBoost: A scalable tree boosting system[C]// KDD′16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2016:785-794.DOI: 10.1145/2939672.2939785.
[16]徐国天,沈耀童. 基于XGBoost和LightGBM双层模型的恶意软件检测方法[J]. 信息网络安全,2020, 20(12): 54-63. DOI: 10.3969/j.issn.1671-1122.2020.12.008.
[17]KE G L, MENG Q, FINLEY T, et al. LightGBM: a highly efficient gradient boosting decision tree[C]// NIPS′17: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). Red Hook: Curran Associates Inc., 2017: 3149-3157.
[18]ZHOU Z H, FENG J. Deep forest[J]. National Science Review,2019, 6(1): 74-86. DOI: 10.1093/nsr/nwy108.
[19]WOLPERT D H. Stacked generalization[J]. Neural Networks,1992, 5(2): 241-259. DOI: 10.1016/S0893-6080(5)80023-1.
[20]BREIMAN L.Stacked regressions[J]. Machine Learning,1996, 24(1): 49-64. DOI: 10.1007/BF00117832.
[21]POWERS D M W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation[J]. Journal of Machine Learning Technologies, 2011,2(1):37-63.
[22]BASIT A, ZAFAR M, LIU X, et al. A comprehensive survey of AI-enabled phishing attacks detection techniques[J].Telecommunication Systems,2020, 76: 139-154. DOI: 10.1007/s11235-020-00733-2.
[23]LAKSHMI L, REDDY M P, SANTHAIAH C, et al. Smart phishing detection in web pages using supervised deep learning classification and optimization technique ADAM[J].Wireless Personal Communications, 2021, 118: 3549-3564. DOI: 10.1007/s11277-021-08196-7.
[24]YUAN J T, CHEN G X, TIAN S W, et al. Malicious URL detection based on a parallel neural joint model[J].IEEE Access,2021, 9: 9464-9472. DOI: 10.1109/ACCESS.2021.3049625.
[25]PARRA G D L T, RAD P, RAYMOND K K, et al. Detecting internet of things attacks using distributed deep learning[J].Journal of Network and Computer Applications,2020, 163: 102662. DOI: 10.1016/j.jnca.2020.102662.
[26]TAJADDODIANFAR F, STOKES J W, GURURAJAN A. Texception: a character/word-level deep learning model for phishing URL detection[C]// 45th International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE,2020: 2857-2861. DOI: 10.1109/ICASSP40776.2020.9053670.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于改进Stacking策略的钓鱼网站检测研究

Study on Phishing Website Detection Based on Improved Stacking Strategy

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

Metrics

本文评价

推荐阅读 10

[1]	段美玲, 潘巨龙. 基于双向LSTM神经网络可穿戴跌倒检测研究[J]. 广西师范大学学报（自然科学版）, 2022, 40(3): 141-150.
[2]	吴玲玉, 蓝洋, 夏海英. 基于卷积神经网络的眼底图像配准研究[J]. 广西师范大学学报（自然科学版）, 2021, 39(5): 122-133.
[3]	马玲, 罗晓曙, 蒋品群. 一种基于PNN的点阵喷码字符识别方法[J]. 广西师范大学学报（自然科学版）, 2020, 38(4): 32-41.