Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition
Author: Chia-Hui Chang, Cheng-Ju Wu, Tzu-Ping Lin
Publish Year: 2022-07-01
Update by: March 26, 2025
摘要
Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.
