广西师范大学学报(自然科学版) ›› 2011, Vol. 29 ›› Issue (1): 157-161.

• • 上一篇    下一篇

中文专家实体主页识别方法研究

李丽娜1, 余正涛1,2, 王亚盛1, 毛存礼1,2, 郭剑毅1,2   

  1. 1.昆明理工大学信息工程与自动化学院,云南昆明650051;
    2.云南省计算机技术应用重点实验室智能信息处理研究所,云南昆明650051
  • 收稿日期:2010-12-29 出版日期:2007-03-25 发布日期:2018-11-16
  • 通讯作者: 余正涛(1970—),男,云南曲靖人,昆明理工大学教授,博士。E-mail: ztyu@hotmail.com
  • 基金资助:
    国家自然科学基金资助项目(60863011);云南省自然科学基金重点资助项目(2008CC023);云南省中青年学术和技术带头人后备人才基金资助项目(2007PY01-11)

Method of Chinese Expert Entity Homepage Recognition

LI Li-na1, YU Zheng-tao1,2, WANG Ya-sheng1, MAO Cun-li1,2, GUO Jian-yi1,2   

  1. 1.College of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650051,China;
    2.Institute of Intelligent Information Processing,Computer TechnologyApplication Key Laboratory of Yunnan Province,Kunming Yunnan 650051,China
  • Received:2010-12-29 Online:2007-03-25 Published:2018-11-16

摘要: 专家实体主页识别是专家检索的一个重要的组成部分,本文提出了一种基于J48的机器学习算法来对中文专家实体主页进行分类识别。首先,人工收集中文专家实体及对应的专家主页面2113个,针对中文专家实体特点,定义与链接和网页内容特征相关的专家实体特征,并对这些特征进行提取,形成训练数据集。然后,采用不同学习算法对在不同特征上的页面进行主页识别,寻找最有效的分类特征和主页识别学习算法。最后,对不同特征、不同算法进行测试,实验结果表明,采用J48算法,结合链接与网页内容特征,中文专家实体主页识别取得了较好的效果,其识别准确率达到了81.05%。

关键词: 中文专家实体, 主页识别, 链接特征, 网页特征, J48

Abstract: Expert Entity Homepage Recognition is one of the keypoints in expert search.In this paper,a method based on J48 is proposed.2113 Chinese expert entities and the corresponding entity homepages are collectedby analyzing the expert resources,and the expert entity features relatedto the features of link and webpage content are defined.Besides,these features are also extractedto form a training data set;and then different learning algorithms with different features are adopted to recognize the expert homepage for finding the most effectiveclassification features and homepage recognition learning algorithm.The experiment results show that the best method is obtained by using J48 algorithm,specifically,when the features of link and webpage content are combined with,theexpert homepage recognition accuracy rate reaches 81.05%.

Key words: Chinese expert entity, homepage recognition, link feature, Webpage feature, J48

中图分类号: 

  • TP391.3
[1] 陆伟,韩曙光.组织专家的检索系统设计与实现[J].情报学报,2008,27(5):657-663.
[2] DAVENPORT T.Knowledge management at hewlett packard,center forbusiness innovation[EB/OL].(1996)[2010-06-09].http://www.businessinnovation.ey.com/research/researchf.html.
[3] BALOG K,AZZOPARDI L,de RIJKE M.Formal models for expert finding inenterprise corpora[C]//Proceedings of the 29th Annual International ACM SIGIRConference on Research and Development in Information Retrieval.New York:ACM Press,2006:43-50.
[4] CAMPBELL C S,MAGLIO P P,COZZI A,et al.Expertise identification using email communications[C]//Proceedings of the twelfth international conference on Information and knowledge management.New York:ACM Pres,2003:528-531.
[5] YI Fang,LUO Si,MATHUR A.FacFinder:search for expertise in academicinstitutions:technical report SERC-TR-294[R].West Lafayette,IN:Department of Computer Science,Purdue University,2008.
[6] DAVENPORT T H,PRUSAK L.Working knowledge:how organizations managewhat they know[M].Boston,MA:Harvard Business School Press,1998.
[7] LIN Ching-yung,GRIFFITHS-FISHER V,EHRLICH K,et al.SmallBlue:people mining for expertise search and social network analysis[J].IEEE MultimediaMagazine,2008,15(1):78-84.
[8] HAN Jia-wei,KAMBER M.Data mining:concepts and techniques[M].2nded.San Francisco:Morgan Kaufman Publishers,2000.
[1] 李双群, 徐久成, 张灵均, 李晓艳. 基于相容粒的彩色图像检索算法[J]. 广西师范大学学报(自然科学版), 2011, 29(3): 173-178.
[2] 夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 133-137.
[3] 崔林卫, 苏伟, 郭卫, 李廉. 基于Nutch的Web数学公式提取[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 167-172.
[4] 唐楠, 杨志豪, 吴佳金, 王艳华, 林鸿飞. 基于监督学习的蛋白质络合物抽取方法[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 174-179.
[5] 罗辛, 潘乔, 王洪亚, 陈美, 北研二. 基于SOFM的高速图像检索算法实现[J]. 广西师范大学学报(自然科学版), 2011, 29(2): 180-184.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发