Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

This is the demonstration page of the paper “Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation” with some selected samples generated with the proposed methods.



Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high- quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single- stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.


The following examples are selected from the listening test described in the paper, which are generated from LibriSpeech test-clean subset. The models to compare include the one-stage model without semantic knowledge distillation (NARSiSbase), one-stage model with semantic knowledge distillation (NARSiSavg), and the two-stage model (NAR 2-stage).

Text: “this evening they all said”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “the paris plant like that at the crystal palace was a temporary exhibit”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “the greeting of the apostle is refreshing”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “i doubt whether branwell was maintaining himself at this time”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “well i’m going as an engineer you can go as one”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “he seemed to wait for her reply but as she made none he proceeded”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “there is no fear of that sir”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “these he gave to three of my brothers”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “she saw that the bed was gilded and so rich that it seemed that of a prince rather than of a private gentleman”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “he makes it sort of cozier”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “he has given them with too much grace not to have others still to give if they are required which is the case at the present moment”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “the hawk embittered by the loss of his first quarry had become as dogged in pursuit as a weasel not to be shaken off or evaded or deceived”

NARSiSbase NARSiSavg NAR 2-stage Ground truth

Text: “but your power is so superior to any that i can advance as to make us here feel that there is no disgrace in yielding to it”

NARSiSbase NARSiSavg NAR 2-stage Ground truth