不确定数据的高效聚类算法

广西师范大学学报（自然科学版） ›› 2011, Vol. 29 ›› Issue (2): 161-166.

不确定数据的高效聚类算法

李云飞, 王丽珍, 周丽华

云南大学信息学院,云南昆明650091

收稿日期:2011-05-08 发布日期:2018-11-19
通讯作者: 王丽珍(1962—),女,云南丽江人,云南大学教授,博士。E-mail:lzhwang2005@126.com
基金资助:
国家自然科学基金资助项目(61063008);云南省教育厅研究基金资助项目(09Y0048);云南大学科学研究基金资助项目(2009F29Q)

More Effcient Clustering Algorithm Over Uncertain Data

LI Yun-fei, WANG Li-zhen, ZHOU Li-hua

School of Information Science and Engineering,Yunnan University,Kunming Yunnan 650091,China

Received:2011-05-08 Published:2018-11-19

摘要/Abstract

摘要： 不确定数据聚类是数据挖掘领域中的一个重要的研究热点。本文介绍了不确定数据聚类的uk-means算法及其改进算法ck-means。由于ck-means算法必须计算每个簇到所有对象的质心的距离,因此当聚类的样本很大时,聚类效率依然不是很好。本文提出的kd-means算法只需要计算对象到部分质心的距离,因此可以很大程度地提高ck-means算法的效率。该方法是基于kd树索引而提出的改进策略,并用大量的实验来证明改进算法的有效性。

关键词: kd树, ck-means算法, 期望中心, 候选集, 剪枝

Abstract: Clustering of uncertain data is an important researchdirection in the clustering research field.It has far-reaching applications inreal life.An improved clustering algorithm kd-means is proposed by optimizingclassical ck-means algorithm.The ck-means algorithm needs to calculate the distance of each cluster to the centroid of all objects,so when thesample is large,the clustering efficiency is not very good.The improved algorithm based on the kd-tree structure presented in the paper only needs to calculate part of the distances,which greatly improves the performance of the ck-means algorithm.Experiments demonstrate that the new algorithm is efficient.

Key words: kd-tree, ck-means algorithm, expected centroid, candidate set, pruning

中图分类号:

TP391

李云飞, 王丽珍, 周丽华. 不确定数据的高效聚类算法[J]. 广西师范大学学报（自然科学版）, 2011, 29(2): 161-166.

LI Yun-fei, WANG Li-zhen, ZHOU Li-hua. More Effcient Clustering Algorithm Over Uncertain Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 161-166.

参考文献

[1] HAN Jia-wei,KAMBER M.DataMining:concepts and techniques[M].SanFrancisco:Morgan Kaufmann Publishers,2000.
[2] CHAU M,CHENG R,KAO B.Uncertain data mining:a new research direction[C]//Proceeding Workshop on the Sciences of the Artificial.Washington DC:IEEE Computer Society,2005:199-204.
[3] NGAI W K,KAO B,CHUI C K,et al.Efficient clustering of uncertain data[C]//Proceeding of the 6th IEEE International Conference on Data Mining (ICDM 2006).Washington DC:IEEE Computer Society,2006:436-445.
[4] KRIEGEL H P,PFEIFLE M.Hierarchical density-based clustering of uncertain data[C]//Proceedings of the 5th IEEE International Conference on DataMining (ICDM 2005).Washington DC:IEEE Computer Society,2005:689-692.
[5] NG R T,HAN Jia-wei.Efficient and effective clustering methods forspatial datamining[C]//Proceedings of the VLDB Conference.Santiago:Morgan Kaufmann,1994:144-155.
[6] GUHA S,RASTOGI R,SHIM K.CURE:an efficient clustering algorithm forlarge databases[J].Information Systems,2001,26(1):35-58.
[7] ELKAN C.Using the triangle inequality to accelerate k-means[C]//Proceeding of the International Conference on Machine Learning 2003 (ICML 2003).Washington DC:IEEE Press,2003:609-616.
[8] CHENG R,KALASHNIKOV D,PRABHAKAR S.Querying imprecise data in moving object environments[J].IEEE Transactions on Knowledge and Data Engineering,2004,16(9):1112-1127.
[9] MICHAEL C,REYNOLD C,BEN K,et al.Uncertain data mining:an example in clustering location data[C]//Proceeding of the 10th Pacific-Asia Conferenceon Knowledge Discovery and Data Mining (PAKDD 2006).Berlin:Springer Verlag,2006:199-204.
[10] LEE S D,KAO B,CHENG R.Reducing uk-means to K-means[C]//The 1st Workshop on Data Mining of Uncertain Data (DUNE),in conjunction with ICDM.Trenton,NJ:IEEE Press,2007:483-488.

相关文章 15

[1]	张灿龙, 李燕茹, 李志欣, 王智文. 基于核相关滤波与特征融合的分块跟踪算法[J]. 广西师范大学学报（自然科学版）, 2020, 38(5): 12-23.
[2]	王健, 郑七凡, 李超, 石晶. 基于ENCODER_ATT机制的远程监督关系抽取[J]. 广西师范大学学报（自然科学版）, 2019, 37(4): 53-60.
[3]	肖逸群, 宋树祥, 夏海英. 基于多特征的快速行人检测方法及实现[J]. 广西师范大学学报（自然科学版）, 2019, 37(4): 61-67.
[4]	王勋, 李廷会, 潘骁, 田宇. 基于改进模糊C均值聚类与Otsu的图像分割方法[J]. 广西师范大学学报（自然科学版）, 2019, 37(4): 68-73.
[5]	陈凤,蒙祖强. 基于BTM和加权K-Means的微博话题发现[J]. 广西师范大学学报（自然科学版）, 2019, 37(3): 71-78.
[6]	张随远, 薛源海, 俞晓明, 刘悦, 程学旗. 多文档短摘要生成技术研究[J]. 广西师范大学学报（自然科学版）, 2019, 37(2): 60-74.
[7]	孙容海, 施林甫, 黄丽艳, 唐振军, 俞春强. 基于图像插值和参考矩阵的可逆信息隐藏算法[J]. 广西师范大学学报（自然科学版）, 2019, 37(2): 90-104.
[8]	朱勇建, 彭柯, 漆广文, 夏海英, 宋树祥. 基于机器视觉的太阳能网版缺陷检测[J]. 广西师范大学学报（自然科学版）, 2019, 37(2): 105-112.
[9]	王祺, 邱家辉, 阮彤, 高大启, 高炬. 基于循环胶囊网络的临床语义关系识别研究[J]. 广西师范大学学报（自然科学版）, 2019, 37(1): 80-88.
[10]	武文雅, 陈钰枫, 徐金安, 张玉洁. 基于高层语义注意力机制的中文实体关系抽取[J]. 广西师范大学学报（自然科学版）, 2019, 37(1): 32-41.
[11]	岳天驰, 张绍武, 杨亮, 林鸿飞, 于凯. 基于两阶段注意力机制的立场检测方法[J]. 广西师范大学学报（自然科学版）, 2019, 37(1): 42-49.
[12]	余传明, 李浩男, 安璐. 基于多任务深度学习的文本情感原因分析[J]. 广西师范大学学报（自然科学版）, 2019, 37(1): 50-61.
[13]	林原, 刘海峰, 林鸿飞, 许侃. 基于损失函数融合的组排序学习方法[J]. 广西师范大学学报（自然科学版）, 2019, 37(1): 62-70.
[14]	万福成,马宁,何向真. 融合事件特征及语义角色标注的藏文信息抽取技术[J]. 广西师范大学学报（自然科学版）, 2018, 36(2): 18-23.
[15]	夏海英,刘伟涛,朱勇建. 一种改进的快速SUSAN棋盘格角点检测算法[J]. 广西师范大学学报（自然科学版）, 2018, 36(1): 44-52.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed