基於頁面層級之快速網頁資料擷取與綱要驗證
Efficient Web Data Extraction Via Page-Level Schema Induction &; Verification
Author: 陳天盛
Publish Year: 2014-07
Update by: March 31, 2025
摘要
The problem of automatically extracting data from web pages has been studied more than ten years. However, existing researches have limitations due to high structural complexity in web pages. On the other hand, the necessity of extracting data from large amount of web pages make it a challenging task for researchers.Web data extraction can be classified into two categories based on the extraction targets, record-level task and page-level task. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem, and there are still much to be desired on effectiveness and efficiency. On the other hands, previous page-level systems focus on how to achieve unsupervised training and pay less concern about how to extract data from testing pages by matching with a wrapper.In this paper, we propose a learning based architecture for page-level extraction systems. Given a large amount of web pages for data extraction, the system use part of the input pages for training the schema, and then extract data from the rest of the input pages through wrapper verification. In our experiments, our system works better than other page-level extraction systems in terms of schema accuracy and extraction efficiency for multi-record pages. Overall, the extraction efficiency is dozens of times higher than state-of-the-art unsupervised approaches that extract data page by page without learning scheme (wrapper verification).