應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取

Exploiting Dynamic Encoding and Multiple Pages for Record Boundary Detection and Data Extraction

Author: 陳明權

Publish Year: 2014-07

Update by: March 31, 2025

摘要

Record boundary detection plays an important role in wrapper induction and the quality of record boundary detection will affect the precision of alignment and extraction directly. Previous approaches usually focus on calculating similarity between blocksor measure tree similarity in a single page.In this paper, we analyze multiple pages that are generated by the same website. By exploring common parts and different parts in pages, we can overcome the weakness in single-page approaches. Because the computation load will increase when we deal with more pages, the proposed approach only focus on leaf nodes in DOM tree, which are about 30 percent of all nodes. We propose dynamic encoding, which can abstract leaf nodes and emphasize the regularity of every data records. With the dynamic encoding, we reduce the numberof the repeated pattern discovered. Finally, we propose the idea of landmark, which is located in the data record, and detecting the record boundary by segmenting the DOM tree. In the experiment, we evaluate the efficiencyin our approach and compare the effectivenesswith other systems.