OPTIMIZATION OF TEXT DATABASE USING HIERACHICAL CLUSTERING
Many speech and language related techniques employ models that are trained using text data. In this paper, we introduce a novel method for selecting optimized training sets from text databases. The coverage of the subset selected for training is optimized using hierarchical clustering and the generalized Levenshtein distance. The validity of the proposed subset optimization technique is verified in a data-driven syllabification task. The results clearly indicate that the proposed approach meaningfully optimizes the training set, which in turn improves the quality of the trained model. Compared to the existing state-of-the-art data selection technique, the proposed hierarchical clustering approach improves the compactness of data clusters, decreases the computational complexity and makes data set selection scalable. The presented idea can be used in a wide variety of language processing applications that require training with text data.
Jilei_ICASSP2009.pdf — PDF document, 198Kb