Journal of Guangxi Normal University(Natural Science Edition) ›› 2011, Vol. 29 ›› Issue (1): 133-137.

Previous Articles     Next Articles

Content Extraction of Web Page Based on Extended Label Tree

XIA Tian1,2   

  1. 1.Key Laboratory of Data Engineering and Knowledge Engineering,MOE,Beijing 100872,China;
    2.College of Information Resource Management,Renmin University of China,Beijing 100872,China
  • Received:2010-12-29 Published:2018-11-16

Abstract: A content extraction method based on extended label tree is proposed.Web page cleaning and auxiliary information for extracting purpose are realized,and the coordinates of position are also set during the construction phase of extended label tree.Text nodes are regarded as the identifiers ofthe content region,then,the neighbor text node set with maximum coverage is selected and revised to form the final content region.Through the neighbor first traversal algorithm,the title node is located and additional properties are extracted.Experimental results show that the proposed method can achieve high-precision for common article page extraction and has good adaptability.

Key words: Web page content extraction, extended label tree, neighbor first traversal

CLC Number: 

  • TP391.3
[1] 刘兵.Web数据挖掘[M].北京:清华大学出版社,2009:231-274.
[2] KUSHMERICK N.Wrapper induction for information extraction[D].Seattle:University of Washington,1997.
[3] SUHIT G,GAIL K,DAVID N,et al.DOM-based content extraction of HTMLdocuments[C]//Proceedings of the 12th international conference on World WideWeb.New York:ACM Press,2003:207-214.
[4] 王琦,唐世渭,杨冬清,等.基于DOM的网页主题信息自动抽取[J].计算机研究与发展,2004,41(10):1786-1792.
[5] 任玉,樊勇,郑家恒.基于分块的网页主题文本抽取[J].广西师范大学学报:自然科学版,2009,27(1):141-144.
[6] CAI Deng,YU Shi-peng,WEN Ji-rong,et al.VIPS:a vision-basedpage segmentation algorithm:MSR-TR-2003-79[R].Beijing:Microsoft Research,2003.
[7] 张霞亮,陈家骏.基于逻辑行和最大接纳距离的网页正文抽取[J].计算机工程与应用,2009,45(25):125-128.
[8] 王利,刘宗田,王燕华,等.基于内容相似度的网页正文提取[J].计算机工程,2010,36(6):102-104.
[9] VNIKIC.HtmlCleaner[EB/OL].(2008-09-02)[2010-11-01].http://htmlcleaner.sourceforge.net/.
[10] 汉语言智能实验室.新闻类网页正文提取在线演示系统[EB/OL].(2009-08-16)[2010-11-01].http://dm.griddss.c-n/contentdemo.aspx.
[1] LI Shuang-qun, XU Jiu-cheng, ZHANG Ling-jun, LI Xiao-yan. Color Image Retrieval Based on Tolerance Granules [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(3): 173-178.
[2] LI Li-na, YU Zheng-tao, WANG Ya-sheng, MAO Cun-li, GUO Jian-yi. Method of Chinese Expert Entity Homepage Recognition [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 157-161.
[3] CUI Lin-wei, SU Wei, GUO Wei, LI Lian. Extraction of Web Mathematical Formulas Based on Nutch [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(1): 167-172.
[4] TANG Nan, YANG Zhi-hao, WU Jia-jin, WANG Yan-hua, LIN Hong-fei. Method of Predicting Protein Complex Based on Supervised Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 174-179.
[5] LUO Xin, PAN Qiao, WANG Hong-ya, CHEN Mei, KITA Kenji. Realization of High-speed Image Search Based on SOFM [J]. Journal of Guangxi Normal University(Natural Science Edition), 2011, 29(2): 180-184.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!