Automatic Extraction of Information Blocks Using PAT Trees
Author: C.-H. Chang, C.-N. Hsu
Publish Year: 1999
Update by: March 30, 2025
摘要
Information extraction from semi-structured Web documents is a critical issue for software agents on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors, but this approach still requires human intervention to provide training examples. In this paper, we present a novel approach that extracts information blocks without training examples using a data structure called a PAT tree. PAT trees allow the system to efficiently recognize repeated patterns in a semi-structured Web page. From these repeated patterns, information blocks can be easily located based on some domain independent selection criteria. The entire system runs automatically without any human intervention. Experimental results show that our approach performs well with a recall rate near 90 percent on a wide range of output pages of popular search engines.