A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling
Author: Qian-Xiang Lin, Chia-Hui Chang, Chen-Ling Chen
Publish Year: 2010-12
Update by: March 26, 2025
摘要
In many Chinese text processing tasks, Chinese word segmentation is a vital and required step. Various methods have been proposed to address this problem usingmachine learning algorithm in previous studies. In order to achieve highperformance, many studies used external resources and combined with variousmachine learning algorithms to help segmentation. The goal of this paper is toconstruct a simple and effective Chinese word segmentation tool without externalresources, that is, a closed test for Chinese word segmentation. We use trainingdata to construct a vocabulary to combine maximum matching word segmentationresults with sequence labeling methods including hidden Markov model (HMM)and conditional random fields (CRF). The major idea is to provide machinelearning algorithm with ambiguity information via forward and backwardmaximum matching as well as unknown word information via vocabulary masking.The experimental results show that maximum matching and vocabulary maskingcan significantly improve the performance of HMM segmentation (F-measure:0.812 → 0.948 → 0.953). Meanwhile, combining maximum matching with CRFachieves a performance with 0.953 and is improved to 0.963 via vocabularymasking.