Abstract. Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as Discrete Representation Inconsistency (DRI). This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datasets (44,000 hours) demonstrate the effectiveness and generality of our method.
Figure: The Illustration of Discrete Representation Inconsistency (DRI) phenomenon. Subfigure (a) shows that text, whether it includes contextual information or not, can be encoded by the text tokenizer into the same text tokens. In contrast, Subfigure (b) illustrates that audio, with or without contextual information, is encoded by the audio tokenizer into different audio tokens. The DRI phenomenon within the audio tokenizer poses a many-to-one mapping problem, and the complexity of this many-to-one mapping raises the uncertainty for neural codec language models in predicting the next token.
To analyze the DRI phenomenon, we use neural audio codecs as audio tokenizers to quantize both the entire audio and an audio segment within that audio, and then compare the results of their corresponding discrete audio token sequences. As you can hear, these two audio segments are exactly identical with the only difference being whether there is context, but the discrete audio token sequences are very different, which raises uncertainty for neural codec language models in predicting the next token.
Ground Truth | Transcript |
---|---|
I believe, Charlie, he recommenced suddenly, there is not such an unnatural family on record as ours; is there? |
Neural Audio Codec | Reconstructed Audio | Audio Tokens | Reconstructed Audio Segments | Audio Tokens from Audio Segments | First Layer's Consistency Accuracy | First 3 Layers' Consistency Accuracy |
---|---|---|---|---|---|---|
Ours |
..., 894, 894, 12, 894, 894, 256, 1009, 928, 895, 798, 498, 303, 731, 367, 388, 663, 396, 95, 155, 937, 577, 933, 595, 33, 715, 33, 33, 535, 535, 15, 587, 20, 20, 280, 21, 21, 327, 21, 17, 17, 150, 17, 150, 126, 126, 126, 126, 126, 126, 126 ... |
..., 894, 894, 12, 894, 894, 256, 1009, 928, 895, 798, 498, 303, 731, 367, 388, 663, 396, 95, 155, 937, 577, 933, 595, 33, 715, 33, 33, 535, 535, 15, 587, 20, 20, 280, 21, 21, 327, 21, 17, 17, 150, 17, 150, 126, 126, 126, 126, 126, 126, 126, ... |
100.00% |
92.00% |
||
Ours w/o consistency constraint |
..., 40, 40, 40, 1020, 549, 432, 439, 923, 193, 642, 375, 570, 282, 745, 864, 317, 807, 184, 807, 723, 566, 861, 128, 934, 934, 717, 150, 507, 592, 738, 943, 786, 166, 786, 786, 166, 786, 786, 786, 117, 117, 117, 117, 117, 117, 117, 882, 882, 882, 882, ... |
..., 171, 171, 171, 549, 549, 233, 439, 923, 158, 433, 956, 570, 938, 367, 151, 317, 807, 156, 5, 777, 145, 956, 498, 934, 453, 453, 934, 934, 1023, 679, 784, 784, 784, 422, 422, 784, 784, 784, 422, 422, 846, 784, 407, 882, 637, 637, 515, 637, 637, 445, ... |
14.00% |
8.00% |
||
EnCodec |
..., 463, 373, 463, 463, 53, 373, 373, 731, 461, 537, 457, 392, 233, 679, 185, 112, 185, 432, 699, 136, 321, 967, 136, 321, 224, 136, 321, 224, 984, 1008, 679, 788, 465, 906, 151, 151, 151, 151, 151, 976, 491, 834, 430, 835, 408, 408, 408, 408, 408, 408, 408, 408, 62, 408, 408, 408, 408, 408, 835, 835, 835, 835, 835, 835, 475, 475, 475, 475, 475, 475, 25, 475, 25, 779, 475, ... |
..., 463, 463, 463, 463, 53, 373, 373, 731, 731, 25, 457, 392, 233, 679, 185, 747, 185, 432, 699, 224, 321, 967, 136, 321, 224, 136, 321, 224, 984, 1008, 868, 533, 151, 906, 151, 151, 151, 151, 276, 976, 491, 834, 430, 835, 408, 62, 408, 408, 62, 408, 408, 408, 408, 408, 408, 408, 835, 835, 835, 835, 62, 835, 835, 835, 835, 475, 475, 25, 475, 475, 25, 475, 25, 25, 47, ... |
76.00% |
58.67% |
We demonstrate speech reconstruction results of popular neural audio codecs (e.g., EnCodec) and the codec training with consistency constraint (denoted as Ours).
Neural Audio Codec | Samples | |||
---|---|---|---|---|
sample1 | sample2 | sample3 | sample4 | |
Ground Truth | ||||
Transcript | This I took for a sign that he had himself something to produce and that we should only have to wait. | She looked at his heavy shoulders and big, determined head, thrust forward like a catapult in leash. | Do you suppose that god for the sake of a few lutheran heretics would disown his entire church? | You must see, lieutenant, I should think, that we are not so near the coast of algeria as you imagined. |
Ours | ||||
EnCodec | ||||
HiFiCodec | ||||
SpeechTokenizer | ||||
DAC | ||||
FunCodec |
We demonstrate speech generation results of neural codec language models. The neural codec language models are based on popular neural audio codecs (e.g., EnCodec) and the codec training with consistency consttraint (denoted as Ours). The subscripts of the neural codec language models (e.g., 330M, 44Kh) denote the model size and training data scale.
Neural Audio Codec | Neural Codec Language Model | Samples | |||
---|---|---|---|---|---|
sample1 | sample2 | sample3 | sample4 | ||
Ground Truth | Prompt | And the lectures, and the dissecting rooms, has thee thought of the dissecting rooms? | Infirmities induced by over indulgence are among some peoples freely recognised as manly attributes. | He was particularly attentive to the behavior of their preachers, on whom all depended. | He reminds them of the time when he opposed peter to his face and reproved the chief of the apostles. |
Audio | |||||
Reference | Text | He would say good night, but not good bye . | but that does not remain the sole purpose of their consumption . | The parliament and the scots laid their proposals before the king. | Humble man that he was, he will not now take a back seat. |
Audio | |||||
Ours | VALL-E960h | ||||
VALL-E44Kh | |||||
Ours w/o consistency constraint | VALL-E960h | ||||
VALL-E44Kh | |||||
mHuBERT | SpeechGPT | ||||
EnCodec | VoiceCraft330M | ||||
VoiceCraft880M | |||||
Mel VQ-VAE | XTTS_v2 | ||||
SpeechTokenizer | USLM | ||||
AnyGPT | |||||
VALL-E |