Abstract
Lightweight text-to-speech (TTS) models depend on external aligners such as the Montreal Forced Aligner (MFA), whose Korean acoustic models and lexicons are limited. We propose a lightweight Korean TTS pipeline that needs no external aligner. A CTC-based aligner classifies mel-spectrogram frames directly into phonemes; a two-stage training procedure first jointly trains the CTC aligner with the acoustic model and then retrains the acoustic model from scratch on fixed alignment targets to remove the moving-target problem. On the KSS dataset the proposed method reaches MOS 3.67 with only 3.9 M parameters, outperforming an MFA-based counterpart (3.46) at the same scale.
Audio Samples (KSS)
GT: ground-truth recording. Recon.: HiFi-GAN reconstruction from the GT mel (vocoder upper bound). Ours + MFA: same acoustic model trained with MFA alignments. Ours + CTC: proposed pipeline. The first row is the headline comparison; additional rows showcase the proposed model across utterance lengths.
| # | Text | GT | Recon. | Ours + MFA | Ours + CTC |
|---|
Citation
@misc{jo2026ctcalignedkoreantts,
author = {Minhyung Jo},
title = {A {CTC}-Aligned Lightweight {Korean} {TTS} Pipeline without External Aligners},
month = apr,
year = {2026},
publisher = {Zenodo},
version = {v1},
doi = {10.5281/zenodo.19564646},
url = {https://doi.org/10.5281/zenodo.19564646}
}