OLERA: A semi-supervised approach for Web data extraction with visual support
Author: C.-H. Chang, S.-C. Kuo
Publish Year: 2004-12
Update by: March 31, 2025
摘要
Information extraction (IE) from semi-structured Web documents plays an important role for a variety of information agents. Over the past few years, researchers have developed a rich family of generic IE techniques based on supervised approaches which learn extraction rules from user-labelled training examples. However, annotating training data can be expensive when thousands of data sources need to be wrapped. In this article, we introduce OLERA, a semi-supervised IE system that produces extraction rules without detailed annotation of the training documents. Instead, a rough segment that contains all that need to be extracted in one record is given as an example. OLERA is designed with visualization support such that the discovered records is displayed in a spreadsheet-like table for schema assignment. The experiments show that OLERA performs well for program-generated Web pages with very few training pages and user intervention.