應用動態編碼及分治對齊算法之免標記樣版網頁完整綱要推導研究
Author: 陳燕琴
Publish Year: 2019-07
Update by: March 27, 2025
摘要
Automatic data extraction from template pages is an essential task for data integration and analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton pages, which contain detail information of a single item is less addressed and is more challenging. In the rst work, we propose a novel Divide-and-Conquer Alignment algorithm (DCA) that works on leaf nodes from the DOM trees of singleton pages. The idea is to detect mandatory templates via the longest increasing subsequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. DCA able aligns each segment e ciently and handles multi-order attribute-value pairs e ectively with a two-pass procedure. The results on selected items, DCA outperforms TEX and WEIR 2% and 12% respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F1 measure, on 26 websites from TEX and ExAlg. In the second work, we propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding(DCADE) from either multiple list pages or singleton pages with the same template. We de ne the Content Equivalence Class and Typeset Equivalence Class based on leaf node content. We then combine HTML attributes (id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from speci c to general. We conducted experiments on 49 real world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (FD), and a 0.936 F1 measure for recordset data extraction (FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD=0.660), TEX(FD=0.454 and FS=0.549), RoadRunner (FD=0.396 and FS=0.330), and UWIDE (FD=0.260 and FS=0.081).