广西师范大学学报(自然科学版) ›› 2011, Vol. 29 ›› Issue (1): 167-172.

• • 上一篇    下一篇

基于Nutch的Web数学公式提取

崔林卫, 苏伟, 郭卫, 李廉   

  1. 兰州大学信息科学与工程学院,甘肃兰州730000
  • 收稿日期:2010-12-22 发布日期:2018-11-16
  • 通讯作者: 苏伟(1977—),男,河北保定人,兰州大学讲师,博士。E-mail: suwei@lzu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61003139,60903102);兰州大学中央高校基本科研业务费专项资金(lzujbky-2010-90)

Extraction of Web Mathematical Formulas Based on Nutch

CUI Lin-wei, SU Wei, GUO Wei, LI Lian   

  1. College of Information Science and Engineering,Lanzhou University,Lanzhou Gansu 730000,China
  • Received:2010-12-22 Published:2018-11-16

摘要: 本文主要研究基于公式的数学搜索引擎中数学公式的识别和提取方法,总结了MathML、OpenMath、LaTex、Infix格式数学公式在网页中出现时的各自特征,提出了基于特征和启发式规则的公式识别和提取方法,并用实验说明了该识别和提取方法的可行性和准确率。

关键词: 搜索引擎, 爬虫, 公式搜索, 数学公式, MathML, OpenMath

Abstract: The paper introduces the recognizing and extracting methods of mathematics expressions in formula-based mathematics search engine.Itsummarizes the corresponding features of MathML,OpenMath,LaTex and Infix when they are embedded in a Web page.A feature-based heuristic method of recognizing and extracting mathematical expressions is given in the paper.The experimentsproves that the method is effective and useful.

Key words: search engine, crawler, formulas search, mathematical formulas, MathML, OpenMath

中图分类号: 

  • TP391.3
[1] KWARC Research Group.Math web search[CP/OL].[2010-10-12].http://trac.mathweb.org/MWS/.
[2] MINER R.The mathdex search engine[EB/OL].(2007)[2010-08-14].http://www.ima.umn.edu/2006-2007/SW-12.8-9.06/activities/Miner-Robert/index.html.
[3] The ActiveMath Project Group.The activeMath project[CP/OL].[2010-10-12].http://www.leactivemath.org/.
[4] YOUSSEF A.An information search and retrieval of mathematical contents:issues and methods[C]//Proceedings of the ISCA 14th International Conference on Intelligent and Adaptive Systems and Software Engineering.Cary,NC:ISCA,2005:100-105.
[5] YOUSSEF A.Roles of math search in mathematics[C]//Proceedings ofthe 5th International Conference on Mathematical Knowledge Management:LNAI Vol4108.Berlin:Spinger,2006:2-16.
[6] YOUSSEF A.Methods of relevance ranking and hit-content generationin math search[C]//Proceedings of the 6th Mathematical Knowledge Management Conference:LNCS Vol4573.Berlin:Springer,2007:393-406.
[7] 景珂.网络数学搜索中的数学查询语言与索引的研究[D].兰州:兰州大学信息科学与工程学院,2009.
[8] SUZUKI M,TAMARI F,FUKUDA R,et al.INFTY:an integrated OCR system for mathematical documents[C]//Proceedings of the 2003 ACM symposium on Documentengineering.New York:ACM Press,2003:95-104.
[1] 吕学强, 舒燕, 孙立华, 程涛. 搜索引擎日志中“V+N1+N2”型短语研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 109-115.
[2] 肖诗斌, 赵红改, 王洪俊, 吕学强. 搜索引擎日志中“N1+N2+V”型名词短语研究[J]. 广西师范大学学报(自然科学版), 2011, 29(1): 116-122.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 广西师范大学学报(自然科学版)编辑部
地址:广西桂林市三里店育才路15号 邮编:541004
电话:0773-5857325 E-mail: gxsdzkb@mailbox.gxnu.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发