广西师范大学学报(自然科学版) ›› 2011, Vol. 29 ›› Issue (1): 92-97.

• • 上一篇    下一篇

基于特征加权的半监督聚类研究

黎佳, 王明文, 何世柱, 柯丽   

  1. 江西师范大学计算机信息工程学院,江西南昌330022
  • 收稿日期:2010-12-14 发布日期:2018-11-16
  • 通讯作者: 王明文(1964—),男,江西南康人,江西师范大学教授,博导。E-mail: mwwang@jxnu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(60963014);江西省自然科学基金资助项目(2008GZS0052)

Semi-supervised Clustering with Feature Weighting

LI Jia, WANG Ming-wen, HE Shi-zhu, KE Li   

  1. College of Computer Information Engineering,Jiangxi Normal University,Nanchang Jiangxi 330022,China
  • Received:2010-12-14 Published:2018-11-16

摘要: 目前在半监督聚类的研究中,尤其是当有类标信息的类的数量少于整个数据集的类的数量时,其聚类效果并不好。本文在现有半监督聚类技术的基础上,通过特征加权来提高同一类文档的相似性,从而得到更好的聚类效果。为了验证这一思想的有效性,实验不仅在单语言数据集上进行,还在中、英双语数据集上进行了只包含中文或英文类标时的聚类实验。实验结果表明,该方法体现出良好的性能。

关键词: 部分类标信息, 特征加权, 多语言, 半监督聚类

Abstract: Semi-supervised clustering is a new research direction of machine learning in recent years and an important branch of data mining,which has gradually become an useful tool in many areas.However,in the research for semi-supervised clustering now,especially when the number of classes in labled informations less than the entire data set,its clustering accuracy is not good.On the basis of the existing semi-clustering technology,the similarity of the samecluster of documents is improved by feature weighting with better clustering result.In order to verify the validity of this idea,experiment is carried out not only on the single-language data sets,but also the Sino-British data set in the labled document containing only Chinese or English language.The experimental results show that the method performs well.

Key words: parts of labled information, feature weighting, multi-language, semi-supervised clustering

中图分类号: 

  • TP181
[1] ZHU Xiao-jin.Semi-supervised learning literature survey:report 1530[R].Madison:Department of Computer Sciences,University of Wisconsin at Madison,2006.[2010-07-08].http://wr.lib.tsinghua.edu.cn/node/17544.
[2] SZUMMER M,JAAKKOLA T.Partially labeled classification with Markovrandom walks[M]//THOMAS G D,BECKER S,GHAHRAMANI Z.Advances in Neural Information Processing Systems 14.Cambridge,MA:MIT Press,2001:945-952.
[3] DAVIDSON I,RAVI S S.Clustering with constraints:feasibility issuesand the K-means algorithm[C]//Proceedings of the 5th SIAM InternationalConference on Data Mining.Newport Beach,CA:SIAM,2005:138-149.
[4] WANG Ming-wen,YE Hao,HUANG Guo-bin,et al.A cross lang uage retrieval model based on interlingua semantics[J].Journal of Computational Information Systems,2007,3(4):1555-1560.
[5] LING Xiao,XUE Guo-rong,DAI Wen-yuan,et al.Can Chinese Web pagesbe classified with english data source[C]//Proceedings of the 17th International World Wide Web Conference.Beijing:[s.n.],2008:969-978.
[6] 熊超,王明文,吴福英,等.基于潜在语义对偶空间的跨语言文本分类研究[J].广西师范大学学报:自然科学版,2010,28(1):157-160.
[7] TAN Pang-ning,STEINBACH M,KUMAR V.数据挖掘导论[M].范明,范宏建,译.北京:人民邮电出版社,2006.
[8] 廖海波,万中英,王明文.基于投影寻踪回归文本自动分类的模型[J].清华大学学报:自然科学版,2005,45(S1):1823-1827.
[9] BI Wen-xia,WANG Ming-wen,LUO Yuan-sheng,et al.A new cross language text categorization based on interlingua semantic[J].Journal of Computational Information Systems,2008,4(1):105-110.
[1] 杨洋, 王立宏. 半监督聚类中成对约束的主动学习[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 87-91.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发