Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction

Author: C.-L. Chou, C.-H. Chang, Shin-Yi Wu

Publish Year: 2014-08-24

Update by: March 31, 2025

摘要

Named entity extraction is a fundamental task for many knowledge engineering applications.Existing studies rely on annotated training data, which is quite expensive when used to obtainlarge data sets, limiting the effectiveness of recognition. In this research, we propose an automatic labeling procedure to prepare training data from structured resources which contain knownnamed entities. While this automatically labeled training data may contain noise, a self-testingprocedure may be used as a follow-up to remove low-confidence annotation and increase theextraction performance with less training data. In addition to the preparation of labeled trainingdata, we also employed semi-supervised learning to utilize large unlabeled training data. Bymodifying tri-training for sequence labeling and deriving the proper initialization, we can furtherimprove entity extraction. In the task of Chinese personal name extraction with 364,685 sentences (8,672 news articles) and 54,449 (11,856 distinct) person names, an F-measure of 90.4%can be achieved.