Sequential Pattern Mining for Web Extraction Rule Generalization

Author: C.-H. Chang

Publish Year: 2002-07-14

Update by: March 30, 2025

摘要

Information extraction (IE) is an important problem for information integration with broad applications. It is an attractive application for machine learning. The core of this problem is to learn extraction rules from given input. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. IEPAD is proposed to automate wrapper generation from a multiple-record Web page without user-labeled examples. In this paper, we consider another situation when multiple Web pages are available but each input Web page contains only one record (called singular page). To solve this problem, a hierarchical multiple string alignment approached is proposed to generate the extraction rules from multiple singular pages. In addition, the same method can be applied to IEPAD for finer feature extraction.