Toward Efficient Unsupervised Web Data Extraction: From Unsupervised to Self-Trained Wrappers

Author: Naufal Said

Publish Year: 2020-07

Update by: March 27, 2025

摘要

Webdata extraction is a key component for many business intelligence tasks, such as data transformation, exchange, analysis, and interpretation. Many approaches have been proposed for wrapper induction, either manual, supervised or unsupervised. However,most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In this thesis, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis. Therefore, we can treat unsupervised data extraction as an oracle machine to generate annotated training examples and consider two methods of wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. The experimental result shows that the FSM wrapper can perform well even with fewer training data, while the ML-based models are more efficient during testing but require more training pages to achieve the same effectiveness. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for ML-based wrappers.