基於半監督式學習的網路命名實體辨識模型訓練框架

Author: 周建龍

Publish Year: 2018-07

Update by: March 27, 2025

摘要

Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities havepublic NER support, constructing a framework for NER model training is essential for low resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NERmodel training via semi-supervised learning. Weconsider the effectiveness and efficiency problems of automatic labeling and language independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus andseed entities. Distant supervision, which collects training sentences from search snippets with known envitities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-trainingfor sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. Weconduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.