OnLine Extraction Rule Analysis for Semi-structured Documents” (半結構化文件的資訊擷取方法)
Author: C.-H. Chang, S.-C. Kuo
Publish Year: 2005
Update by: March 31, 2025
摘要
Information extraction (IE) from semi-structured Web documents plays an important role for a variety of information agents. Over the past decade, researchers have developed a rich family of generic IE techniques based on supervised approach which learn extraction rules from user-labelled training examples. However, annotating training data can be expensive when a lot of data sources need to be extracted. In this article, we introduce annotation-free IE using pattern mining and string alignment techniques. We describe OLERA, a semi-supervised IE system that produces extraction rules by aligning similar contents, of multiple input records together and presents the result in a spreadsheet-like table. Therefore, users do not need to annotate the input documents but only to specify the scheme for the extracted data after the extraction pattern is discovered. Another plus is that this approach works not only for multi-record Web pages (as a limitation of some unsupervised IE approaches) but also single-record Web pages.