EfficientSpeech2 — A CTC-Aligned Lightweight Korean TTS Pipeline

Abstract

Lightweight text-to-speech (TTS) models depend on external aligners such as the Montreal Forced Aligner (MFA), whose Korean acoustic models and lexicons are limited. We propose a lightweight Korean TTS pipeline that needs no external aligner. A CTC-based aligner classifies mel-spectrogram frames directly into phonemes; a two-stage training procedure first jointly trains the CTC aligner with the acoustic model and then retrains the acoustic model from scratch on fixed alignment targets to remove the moving-target problem. On the KSS dataset the proposed method reaches MOS 3.67 with only 3.9 M parameters, outperforming an MFA-based counterpart (3.46) at the same scale.

Audio Samples (KSS)

GT: ground-truth recording. Recon.: HiFi-GAN reconstruction from the GT mel (vocoder upper bound). Ours + MFA: same acoustic model trained with MFA alignments. Ours + CTC: proposed pipeline. The first row is the headline comparison; additional rows showcase the proposed model across utterance lengths.

#	Text	GT	Recon.	Ours + MFA	Ours + CTC

Citation

@misc{jo2026ctcalignedkoreantts,
  author    = {Minhyung Jo},
  title     = {A {CTC}-Aligned Lightweight {Korean} {TTS} Pipeline without External Aligners},
  month     = apr,
  year      = {2026},
  publisher = {Zenodo},
  version   = {v1},
  doi       = {10.5281/zenodo.19564646},
  url       = {https://doi.org/10.5281/zenodo.19564646}
}