广西师范大学学报(自然科学版) ›› 2015, Vol. 33 ›› Issue (2): 42-48.doi: 10.16088/j.issn.1001-6600.2015.02.007

• • 上一篇    下一篇

基于查询日志分析的中文网页关键词抽取方法

王晓艳1, 王珍珍2   

  1. 1.福建师范大学协和学院,福建福州350117;
    2.福建师范大学经济学院,福建福州350108
  • 收稿日期:2015-01-14 出版日期:2015-02-10 发布日期:2018-09-20
  • 通讯作者: 王珍珍(1981—),女,福建泉州人,福建师范大学讲师,博士。E-mail: 2936712539@qq.com
  • 基金资助:
    国家社会科学基金资助项目(14CJL001)

Chinese Page Keyword Extraction Method Based on Query Log Analysis

WANG Xiao-yan1, WANG Zhen-zhen2   

  1. 1.Concord College,Fujian Normal University,Fuzhou Fujian 350117,China;
    2.School of Economics,Fujian Normal University,Fuzhou Fujian 350108,China
  • Received:2015-01-14 Online:2015-02-10 Published:2018-09-20

摘要: 以全文索引为基础的网页搜索引擎检索相关度偏低。针对这一问题,本文提出了一种基于查询日志分析的中文网页关键词抽取方法。该方法利用用户对网页与查询词的相关性判断来选择关键词。为了量化用户的相关性判断,提出了单位篇幅停留时间、逆向点击率、排名补偿因子3个指标,并对其进行综合加权。在查询串分词、同义词识别及多义词消歧、关键短语组配方面,也做了特殊处理。实验结果表明:抽取关键词的准确率较高,综合性能也高于TF.IDF和SVM方法。该方法能得到较满意的关键词抽取效果。

关键词: 查询日志, 关键词抽取, 关键短语组配, 同义词识别, 多义词消歧

Abstract: The webpage search engine based on the full-text index provides low correlation. To solve this problem, this paper proposes a keyword extraction method for Chinese pages based on query log analysis. The method selects keywords according to users’ judgment of relevance on the page and query words. In order to quantify the relevance judgment, three indexes, such as residence time per unit length, inverted click rate and rank compensation factor, are proposed of which are then comprehensively weighted. In this paper, these processes, such as query string segmentation, synonym recognition, polysemy disambiguation, keyphrase matching, are specially treated. The experiment results show that the precision rate is high, and the comprehensive performance is better than that of the TF.IDF method and the SVM method. The proposed method can obtain satisfactory effect of the keyword extraction.

Key words: query log, keyword extraction, keyphrase matching, synonym recognition, polysemy disam-
biguation

中图分类号: 

  • G356.6
[1] MATSUO Y, ISHIZUMA M. Keyword extraction from a single document using word co-occurrence statistical Information[J].International Journal on Artificial Intelligence Tools, 2004, 13(1):157-169.
[2] CHIEN Lee-feng. PAT-tree-based keyword extraction for Chinese information retrieval[C] //Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 1997:50-58.
[3] JIAO Hui, LIU Qian, JIA Hui-bo. Chinese keyword extraction based on N-gram and word co-occurrence[C]//Proceedings of the International Conference on Computational Intelligence and Security Workshops. Los Alamitors, CA: IEEE Computer Society, 2007:152-155.
[4] PLANTA E, TONELLI S.KX: a flexible system for keyphrase extraction [C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:170-173.
[5] BEREND G, FARKAS R.SZTERGAK:feature engineering for keyphrase extraction [C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:186-189.
[6] ZERVANOU K.UvT:the UvT term extraction system in the keyphrase extraction task[C] //Proceedings of the 5th International Workshop on Semantic Evaluation.Stroudsburg, PA:Association for Computational Linguistics, 2010:194-197.
[7] 章成志.自动标引研究的回顾与展望[J].现代图书情报技术,2007(11):33-39.
[8] BLEI D M,NG A Y,JORDAN M I. Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[9] PASQUIER C. Task 5:single document keyphrase extraction using sentence clustering and latent dirichlet allocation[C]// Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA:Association for Computational Linguistics, 2010:154-157.
[10] ERCAN G,CICEMLI I.Using lexical chains for keyword extraction[J].Information Processing & Management,2007,43(6):1705-1714.
[11] HACKENBERG M, RUEDA A, ANTONIO P, et al. Clustering of DNA words and biological function:a proof of principle[J].Journal of Theoretical Biology,2012,297:127-136.
[12] HACKENBERG M, CARPENA P, BERNAOLA-GALVAN P, et al. WordCluster:detecting clusters of DNA words and genomic elements[J]. Algorithm Mol Biol,2011,6:2.
[13] CARRETERO-CAMPOS C, BERNAOLA-GALVAN P, CORONADO A V, et al. Improving statistical keyword detection in short texts:entropic and clustering approaches[J].Physica A: Statistical Mechanics and its Applications, 2013, 392(6): 1481-1492.
[14] GROSS J L, YELLEN J. Graph theory and its applications[M].2nd ed. Boca Raton, FL:Chapman & Hall/CRC, 2006.
[15] JIN Wei, SRIHARI R K. Graph-based text representation and knowledge discovery[C]//Proceedings of the 2007 ACM Symposium on Applied Computing. New York:ACM Press,2007:807-811.
[16] HUANG Cheng, TIAN Yong-hong, ZHOU Zhi, et al. Keyphrase extraction using semantic networks structure analysis[C]//Sixth International Conference on Data Mining. Los Alamitos, CA:IEEE Computer Society, 2006:275-284.
[17] MATSUO Y, OHSAWA Y, ISHIZUMA M. KeyWorld:extracting keywords from a document as a small world[C]//Proceedings of the 4th International Conference on Discovery Science: LNCS Vol 2226. Berlin: Springer, 2001:271-281.
[18] MIHALCEA R, TARAU P. TextRank:bringing order into texts[C]// Proceedings of Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2004:404-411.
[19] 李纲,戴强斌.基于词汇链的关键词自动标引方法[J].图书情报知识,2011(3):67-71.
[20] HELTH A. Combining machine learning and natural language processing for automatic keyword extraction[D].Stockholm:Stockholm University, 2004:35-38.
[21] DING Chen, ZHOU Jin, CHI Chi-hung. Automatic keyword extraction by server log analysis[C]// Proceedings of 6th International Conference on Web Information Systems Engineering: NLCS Vol 3806. Berlin: Springer, 2005:605-606.
[22] 陆勇,侯汉清.用于信息检索的同义词自动识别及其进展[J]. 南京农业大学学报:社会科学版,2004, 4(3):87-93.
[23] 章成敏,鞠海燕.基于混合策略的中文查询串相似度计算[J]. 情报杂志,2005(11):101-103.
[24] 钱爱兵,江岚.基于改进TF-IDF的中文网页关键词抽取:以新闻网页为例[J]. 情报理论与实践,2008,31(6):945-950.
[25] 章成志.基于多层特征的中文字符串相似度计算模型[J].情报学报,2005,24(6):696-701.
[26] 刘群,李素建.基于《知网》的词汇语义相似度计算[C]//第三届汉语词汇语义学研讨会论文集.台北:[s.n.],2002:59-76.
[27] ZHANG Kuo, XU Hui, TANG Jie,et al. Keyword extraction using support vector machine[C]// Proceedings of the Seventh International Conference on Web-Age Information Management: LNCS Vol 4016. Berlin: Springer, 2006:85-96.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发