Semi-structured Information Extraction Applying Automatic Pattern Discovery
Author: C.-H. Chang, S.-C. Lui, Y.-C. Wu
Publish Year: 2000-12-06
Update by: March 30, 2025
摘要
Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. Hence, the other track to information extraction tries to save human e ort. For example, Embley et. al. and Chang et al. present different approaches to record boundary identification of a single Web pages without any training example. Embley's work relies on the intra-page structure constructed by HTML tags (the parse tree), while Chang's work is motivated by repeated patterns formed by multiple aligned records. This paper expands Chang's work to IE and discuss the issues when applying pattern discovery for record identification, including the encoding schemes of HTML and ranking criteria of patterns to extract record boundary.