基于扩展标记树的网页正文抽取

广西师范大学学报（自然科学版） ›› 2011, Vol. 29 ›› Issue (1): 133-137.

基于扩展标记树的网页正文抽取

夏天^1,2

1.数据工程与知识工程教育部重点实验室,北京100872;
2.中国人民大学信息资源管理学院,北京100872

收稿日期:2010-12-29 发布日期:2018-11-16
通讯作者: 夏天(1978—),男,山东临朐人,中国人民大学讲师,博士。E-mail: iamxiatian@gmail.com
基金资助:
国家自然科学基金资助项目(09CTQ027);教育部科学技术研究重点项目(109005);中国人民大学科学研究基金项目(22382078)

Content Extraction of Web Page Based on Extended Label Tree

XIA Tian^1,2

1.Key Laboratory of Data Engineering and Knowledge Engineering,MOE,Beijing 100872,China;
2.College of Information Resource Management,Renmin University of China,Beijing 100872,China

Received:2010-12-29 Published:2018-11-16

摘要/Abstract

摘要： 本文给出了一种基于扩展标记树的网页正文抽取方法,通过构建网页扩展标记树,实现对网页的清理和抽取辅助信息的完善,并设置节点坐标定位节点位置;以构成正文内容的文本节点作为正文区域标志,挑选具有最大文本覆盖范围的近邻文本节点集,并进行修正形成正文区域;通过近邻优先遍历算法,实现标题节点的定位和附加属性的抽取。实验结果表明:该方法可以实现常规文章类网页的高精度抽取,并具有良好的适应性。

关键词: 网页正文抽取, 扩展标记树, 近邻优先遍历

Abstract: A content extraction method based on extended label tree is proposed.Web page cleaning and auxiliary information for extracting purpose are realized,and the coordinates of position are also set during the construction phase of extended label tree.Text nodes are regarded as the identifiers ofthe content region,then,the neighbor text node set with maximum coverage is selected and revised to form the final content region.Through the neighbor first traversal algorithm,the title node is located and additional properties are extracted.Experimental results show that the proposed method can achieve high-precision for common article page extraction and has good adaptability.

Key words: Web page content extraction, extended label tree, neighbor first traversal

中图分类号:

TP391.3

夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报（自然科学版）, 2011, 29(1): 133-137.

XIA Tian. Content Extraction of Web Page Based on Extended Label Tree[J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 133-137.

参考文献

[1] 刘兵.Web数据挖掘[M].北京:清华大学出版社,2009:231-274.
[2] KUSHMERICK N.Wrapper induction for information extraction[D].Seattle:University of Washington,1997.
[3] SUHIT G,GAIL K,DAVID N,et al.DOM-based content extraction of HTMLdocuments[C]//Proceedings of the 12th international conference on World WideWeb.New York:ACM Press,2003:207-214.
[4] 王琦,唐世渭,杨冬清,等.基于DOM的网页主题信息自动抽取[J].计算机研究与发展,2004,41(10):1786-1792.
[5] 任玉,樊勇,郑家恒.基于分块的网页主题文本抽取[J].广西师范大学学报:自然科学版,2009,27(1):141-144.
[6] CAI Deng,YU Shi-peng,WEN Ji-rong,et al.VIPS:a vision-basedpage segmentation algorithm:MSR-TR-2003-79[R].Beijing:Microsoft Research,2003.
[7] 张霞亮,陈家骏.基于逻辑行和最大接纳距离的网页正文抽取[J].计算机工程与应用,2009,45(25):125-128.
[8] 王利,刘宗田,王燕华,等.基于内容相似度的网页正文提取[J].计算机工程,2010,36(6):102-104.
[9] VNIKIC.HtmlCleaner[EB/OL].(2008-09-02)[2010-11-01].http://htmlcleaner.sourceforge.net/.
[10] 汉语言智能实验室.新闻类网页正文提取在线演示系统[EB/OL].(2009-08-16)[2010-11-01].http://dm.griddss.c-n/contentdemo.aspx.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed