Design and Implementation of the wrapper generation system for Web-based Information Extraction

Author: Shao-Chen Lui (呂紹誠)

Publish Year: 2001-07

Update by: March 30, 2025

摘要

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Since building Wrappers by hand is tedious and error-prone, the research in this field emphasizes the automatic generation of wrappers that can extract particular information from semi-structured Web documents. Previous work aims to learn extraction rules from users' training example. They solve this problem by labeled training pages and grammar induction to automatically generate extraction rules. For example, WIEN, STALKER, and SoftMealy etc.. However, this approach still requires human intervention to provide training examples.In this paper, we propose IEPAD, a system that automatically discover extraction rules from Web pages. IEPAD includes three components, an extraction rule generator which accepts an input Web page, a graphical user interface, called rule viewer, which shows record patterns discovered, and an extractor module which extracts desired information form similar Web pages according to the extraction rule chosen by the user. The system can automatically identify record boundary by pattern mining and multiple sequence alignment. Furthermore, attribute values can be extract by multi-level extraction. This new track to IE takes less human effort than other approach and involves no content-dependent heuristics. Experimental result shows that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.