|
广西师范大学学报(自然科学版) ›› 2015, Vol. 33 ›› Issue (2): 42-48.doi: 10.16088/j.issn.1001-6600.2015.02.007
王晓艳1, 王珍珍2
WANG Xiao-yan1, WANG Zhen-zhen2
摘要: 以全文索引为基础的网页搜索引擎检索相关度偏低。针对这一问题,本文提出了一种基于查询日志分析的中文网页关键词抽取方法。该方法利用用户对网页与查询词的相关性判断来选择关键词。为了量化用户的相关性判断,提出了单位篇幅停留时间、逆向点击率、排名补偿因子3个指标,并对其进行综合加权。在查询串分词、同义词识别及多义词消歧、关键短语组配方面,也做了特殊处理。实验结果表明:抽取关键词的准确率较高,综合性能也高于TF.IDF和SVM方法。该方法能得到较满意的关键词抽取效果。
中图分类号:
[1] MATSUO Y, ISHIZUMA M. Keyword extraction from a single document using word co-occurrence statistical Information[J].International Journal on Artificial Intelligence Tools, 2004, 13(1):157-169. [2] CHIEN Lee-feng. PAT-tree-based keyword extraction for Chinese information retrieval[C] //Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 1997:50-58. [3] JIAO Hui, LIU Qian, JIA Hui-bo. Chinese keyword extraction based on N-gram and word co-occurrence[C]//Proceedings of the International Conference on Computational Intelligence and Security Workshops. Los Alamitors, CA: IEEE Computer Society, 2007:152-155. [4] PLANTA E, TONELLI S.KX: a flexible system for keyphrase extraction [C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:170-173. [5] BEREND G, FARKAS R.SZTERGAK:feature engineering for keyphrase extraction [C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:186-189. [6] ZERVANOU K.UvT:the UvT term extraction system in the keyphrase extraction task[C] //Proceedings of the 5th International Workshop on Semantic Evaluation.Stroudsburg, PA:Association for Computational Linguistics, 2010:194-197. [7] 章成志.自动标引研究的回顾与展望[J].现代图书情报技术,2007(11):33-39. [8] BLEI D M,NG A Y,JORDAN M I. Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022. [9] PASQUIER C. Task 5:single document keyphrase extraction using sentence clustering and latent dirichlet allocation[C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:154-157. [10] ERCAN G,CICEMLI I.Using lexical chains for keyword extraction[J].Information Processing & Management,2007,43(6):1705-1714. [11] HACKENBERG M, RUEDA A, ANTONIO P, et al. Clustering of DNA words and biological function:a proof of principle[J].Journal of Theoretical Biology,2012,297:127-136. [12] HACKENBERG M, CARPENA P, BERNAOLA-GALVAN P, et al. WordCluster:detecting clusters of DNA words and genomic elements[J]. Algorithm Mol Biol,2011,6:2. [13] CARRETERO-CAMPOS C, BERNAOLA-GALVAN P, CORONADO A V, et al. Improving statistical keyword detection in short texts:entropic and clustering approaches[J].Physica A: Statistical Mechanics and its Applications, 2013, 392(6): 1481-1492. [14] GROSS J L, YELLEN J. Graph theory and its applications[M].2nd ed. Boca Raton, FL:Chapman & Hall/CRC, 2006. [15] JIN Wei, SRIHARI R K. Graph-based text representation and knowledge discovery[C]//Proceedings of the 2007 ACM Symposium on Applied Computing. New York:ACM Press,2007:807-811. [16] HUANG Cheng, TIAN Yong-hong, ZHOU Zhi, et al. Keyphrase extraction using semantic networks structure analysis[C]//Sixth International Conference on Data Mining. Los Alamitos, CA:IEEE Computer Society, 2006:275-284. [17] MATSUO Y, OHSAWA Y, ISHIZUMA M. KeyWorld:extracting keywords from a document as a small world[C]//Proceedings of the 4th International Conference on Discovery Science: LNCS Vol 2226. Berlin: Springer, 2001:271-281. [18] MIHALCEA R, TARAU P. TextRank:bringing order into texts[C]// Proceedings of Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2004:404-411. [19] 李纲,戴强斌.基于词汇链的关键词自动标引方法[J].图书情报知识,2011(3):67-71. [20] HELTH A. Combining machine learning and natural language processing for automatic keyword extraction[D].Stockholm:Stockholm University, 2004:35-38. [21] DING Chen, ZHOU Jin, CHI Chi-hung. Automatic keyword extraction by server log analysis[C]// Proceedings of 6th International Conference on Web Information Systems Engineering: NLCS Vol 3806. Berlin: Springer, 2005:605-606. [22] 陆勇,侯汉清.用于信息检索的同义词自动识别及其进展[J]. 南京农业大学学报:社会科学版,2004, 4(3):87-93. [23] 章成敏,鞠海燕.基于混合策略的中文查询串相似度计算[J]. 情报杂志,2005(11):101-103. [24] 钱爱兵,江岚.基于改进TF-IDF的中文网页关键词抽取:以新闻网页为例[J]. 情报理论与实践,2008,31(6):945-950. [25] 章成志.基于多层特征的中文字符串相似度计算模型[J].情报学报,2005,24(6):696-701. [26] 刘群,李素建.基于《知网》的词汇语义相似度计算[C]//第三届汉语词汇语义学研讨会论文集.台北:[s.n.],2002:59-76. [27] ZHANG Kuo, XU Hui, TANG Jie,et al. Keyword extraction using support vector machine[C]// Proceedings of the Seventh International Conference on Web-Age Information Management: LNCS Vol 4016. Berlin: Springer, 2006:85-96. |
No related articles found! |
|
版权所有 © 广西师范大学学报(自然科学版)编辑部 地址:广西桂林市三里店育才路15号 邮编:541004 电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn 本系统由北京玛格泰克科技发展有限公司设计开发 |