Journal of Guangxi Normal University(Natural Science Edition) ›› 2011, Vol. 29 ›› Issue (1): 138-142.

Previous Articles     Next Articles

Automatic Web News Content Extraction Based on CRFs

ZHANG Chun-yuan   

  1. College of Information Science and Technology,Hainan University,Haikou Hainan 570228,China
  • Received:2010-12-29 Published:2018-11-16

Abstract: Most previous works on Web information extraction seldom use associations among Web page blocks.In order to solve this problem,this paper proposes an automatic Web news content extraction approach based on conditional random fields (CRFs).Firstly,it parses a target news page to a DOM tree.After eliminating invalid nodes,pruning subtrees and deleting single nodes in the tree,it uses heuristic rules to segment the DOM tree to blocks and converts theseblocks into a data sequence.Then,it defines feature functions to extract each block's own state features and neighbor blocks' category transition features.Finally,by labeling the data sequence based on CRFs,it identifies each block's category to extract the page's content.Experimental results indicate that this approach is precise and adaptable for Web news content extraction,and importing associations among page blocks can improve Web news content extraction.

Key words: Web information extraction, conditional random fields, Web page segmentation

CLC Number: 

  • TP391
[1] 胡飞.基于标记树的Web页面区域划分和搜索方法[J].计算机学报,2005,32(8):182-185.
[2] 于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976.
[3] 刘晨曦,吴扬扬.一种基于块分析的网页去噪方法[J].广西师范大学学报:自然科学版,2007,25(2):149-152.
[4] 黄文蓓,杨静,顾君忠.基于分块的网页正文内容提取算法研究[J].计算机应用,2007,27(6):24-26.
[5] 时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,33(19):276-278.
[6] ZHENG Shu-yi,SONG Rui-hua,WEN Ji-rong.Template-independent news extraction based on visual consistency[C]//Proceedings of the 22nd NationalConference on Artificial Intelligence.Menlo Park,CA:AAAI Press,2007:1507-1513.
[7] LAFFERTY J,MCCALLUM A,PEREIRA F.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of 18th International Conference on Machine Learning.San Fransisco:Morgan Kaufmann Publishers Inc,2001:282-289.
[8] 黄健斌,姬红兵,孙鹤立.基于混合跳链随机场的异构Web记录集成方法[J].软件学报,2008,19(8):2149-2158.
[9] SHA F,PEREIRA F.Shallow parsing with conditional random fields[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.Stroudsburg,PA:Association for Computational Linguistics,2003:131-14
[1] ZHANG Canlong, LI Yanru, LI Zhixin, WANG Zhiwen. Block Target Tracking Based on Kernel Correlation Filter and Feature Fusion [J]. Journal of Guangxi Normal University(Natural Science Edition), 2020, 38(5): 12-23.
[2] WANG Jian, ZHENG Qifan, LI Chao, SHI Jing. Remote Supervision Relationship Extraction Based on Encoder and Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 53-60.
[3] XIAO Yiqun, SONG Shuxiang, XIA Haiying. Fast Pedestrian Detection Method Based on Multi-Features    and Implementation [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 61-67.
[4] WANG Xun, LI Tinghui, PAN Xiao, TIAN Yu. Image Segmentation Method Based on Improved Fuzzy C-means Clustering and Otsu Maximum Variance [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(4): 68-73.
[5] CHEN Feng,MENG Zuqiang. Topic Discovery in Microblog Based on BTM and Weighting K-Means [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(3): 71-78.
[6] ZHANG Suiyuan, XUE Yuanhai, YU Xiaoming, LIU Yue, CHENG Xueqi. Research on Short Summary Generation of Multi-Document [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(2): 60-74.
[7] SUN Ronghai, SHI Linfu, HUANG Liyan, TANG Zhenjun, YU Chunqiang. Reversible Data Hiding Based on Image Interpolation and Reference Matrix [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(2): 90-104.
[8] ZHU Yongjian, PENG Ke, QI Guangwen, XIA Haiying, SONG Shuxiang. Defect Detection of Solar Panel Based on Machine Vision [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(2): 105-112.
[9] WANG Qi,QIU Jiahui,RUAN Tong,GAO Daqi,GAO Ju. Recurrent Capsule Network for Clinical Relation Extraction [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 80-88.
[10] WU Wenya,CHEN Yufeng,XU Jin’an,ZHANG Yujie. High-level Semantic Attention-based Convolutional Neural Networks for Chinese Relation Extraction [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 32-41.
[11] YUE Tianchi, ZHANG Shaowu, YANG Liang, LIN Hongfei, YU Kai. Stance Detection Method Based on Two-Stage Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 42-49.
[12] YU Chuanming,LI Haonan,AN Lu. Analysis of Text Emotion Cause Based on Multi-task Deep Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 50-61.
[13] LIN Yuan, LIU Haifeng, LIN Hongfei, XU Kan. Group Ranking Methods with Loss Function Incorporation [J]. Journal of Guangxi Normal University(Natural Science Edition), 2019, 37(1): 62-70.
[14] WAN Fucheng,MA Ning,HE Xiangzhen. Tibetan Information Extraction Technology Integrated with Event Feature and Semantic Role Labeling [J]. Journal of Guangxi Normal University(Natural Science Edition), 2018, 36(2): 18-23.
[15] XIA Haiying,LIU Weitao,ZHU Yongjian. An Improved Fast SUSAN Chessboard Corner Detection Algorithm [J]. Journal of Guangxi Normal University(Natural Science Edition), 2018, 36(1): 44-52.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!