Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models


Abstract. Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as Discrete Representation Inconsistency (DRI). This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datasets (44,000 hours) demonstrate the effectiveness and generality of our method.


Table of Contents


1 Overview

The Illustration of Representation Inconsistency (DRI) phenomenon

Figure: The Illustration of Discrete Representation Inconsistency (DRI) phenomenon. Subfigure (a) shows that text, whether it includes contextual information or not, can be encoded by the text tokenizer into the same text tokens. In contrast, Subfigure (b) illustrates that audio, with or without contextual information, is encoded by the audio tokenizer into different audio tokens. The DRI phenomenon within the audio tokenizer poses a many-to-one mapping problem, and the complexity of this many-to-one mapping raises the uncertainty for neural codec language models in predicting the next token.


2 Analysis on DRI Phenomenon within Neural Audio Codecs

To analyze the DRI phenomenon, we use neural audio codecs as audio tokenizers to quantize both the entire audio and an audio segment within that audio, and then compare the results of their corresponding discrete audio token sequences. As you can hear, these two audio segments are exactly identical with the only difference being whether there is context, but the discrete audio token sequences are very different, which raises uncertainty for neural codec language models in predicting the next token.


Ground Truth Transcript
I believe, Charlie, he recommenced suddenly, there is not such an unnatural family on record as ours; is there?

Neural Audio Codec Reconstructed Audio Audio Tokens Reconstructed Audio Segments Audio Tokens from Audio Segments First Layer's Consistency Accuracy First 3 Layers' Consistency Accuracy
Ours
..., 
894, 894, 12, 894, 894, 
256, 1009, 928, 895, 798, 
498, 303, 731, 367, 388, 
663, 396, 95, 155, 937, 
577, 933, 595, 33, 715, 
33, 33, 535, 535, 15, 
587, 20, 20, 280, 21, 
21, 327, 21, 17, 17, 
150, 17, 150, 126, 126, 
126, 126, 126, 126, 126 
...
                
...,
894, 894, 12, 894, 894, 
256, 1009, 928, 895, 798, 
498, 303, 731, 367, 388, 
663, 396, 95, 155, 937, 
577, 933, 595, 33, 715, 
33, 33, 535, 535, 15, 
587, 20, 20, 280, 21, 
21, 327, 21, 17, 17, 
150, 17, 150, 126, 126, 
126, 126, 126, 126, 126,
...
                
100.00%
92.00%
Ours w/o consistency constraint
..., 
40, 40, 40, 1020, 549, 
432, 439, 923, 193, 642, 
375, 570, 282, 745, 864, 
317, 807, 184, 807, 723, 
566, 861, 128, 934, 934, 
717, 150, 507, 592, 738, 
943, 786, 166, 786, 786, 
166, 786, 786, 786, 117, 
117, 117, 117, 117, 117, 
117, 882, 882, 882, 882, 
...
              
...,
171, 171, 171, 549, 549, 
233, 439, 923, 158, 433, 
956, 570, 938, 367, 151, 
317, 807, 156, 5, 777, 
145, 956, 498, 934, 453, 
453, 934, 934, 1023, 679, 
784, 784, 784, 422, 422, 
784, 784, 784, 422, 422, 
846, 784, 407, 882, 637, 
637, 515, 637, 637, 445,
...
              
14.00%
8.00%
EnCodec
..., 
463, 373, 463, 463, 53, 
373, 373, 731, 461, 537, 
457, 392, 233, 679, 185, 
112, 185, 432, 699, 136, 
321, 967, 136, 321, 224, 
136, 321, 224, 984, 1008, 
679, 788, 465, 906, 151, 
151, 151, 151, 151, 976, 
491, 834, 430, 835, 408, 
408, 408, 408, 408, 408, 
408, 408, 62, 408, 408, 
408, 408, 408, 835, 835, 
835, 835, 835, 835, 475, 
475, 475, 475, 475, 475, 
25, 475, 25, 779, 475,
...
                
...,
463, 463, 463, 463, 53, 
373, 373, 731, 731, 25, 
457, 392, 233, 679, 185, 
747, 185, 432, 699, 224, 
321, 967, 136, 321, 224, 
136, 321, 224, 984, 1008, 
868, 533, 151, 906, 151, 
151, 151, 151, 276, 976, 
491, 834, 430, 835, 408, 
62, 408, 408, 62, 408, 
408, 408, 408, 408, 408, 
408, 835, 835, 835, 835, 
62, 835, 835, 835, 835, 
475, 475, 25, 475, 475, 
25, 475, 25, 25, 47,
...
                
76.00%
58.67%

3 Speech Reconstruction Results

We demonstrate speech reconstruction results of popular neural audio codecs (e.g., EnCodec) and the codec training with consistency constraint (denoted as Ours).

Neural Audio Codec Samples
sample1 sample2 sample3 sample4
Ground Truth
Transcript This I took for a sign that he had himself something to produce and that we should only have to wait. She looked at his heavy shoulders and big, determined head, thrust forward like a catapult in leash. Do you suppose that god for the sake of a few lutheran heretics would disown his entire church? You must see, lieutenant, I should think, that we are not so near the coast of algeria as you imagined.
Ours
EnCodec
HiFiCodec
SpeechTokenizer
DAC
FunCodec

4 Speech Generation Results

We demonstrate speech generation results of neural codec language models. The neural codec language models are based on popular neural audio codecs (e.g., EnCodec) and the codec training with consistency consttraint (denoted as Ours). The subscripts of the neural codec language models (e.g., 330M, 44Kh) denote the model size and training data scale.

Neural Audio Codec Neural Codec Language Model Samples
sample1 sample2 sample3 sample4
Ground Truth Prompt And the lectures, and the dissecting rooms, has thee thought of the dissecting rooms? Infirmities induced by over indulgence are among some peoples freely recognised as manly attributes. He was particularly attentive to the behavior of their preachers, on whom all depended. He reminds them of the time when he opposed peter to his face and reproved the chief of the apostles.
Audio
Reference Text He would say good night, but not good bye . but that does not remain the sole purpose of their consumption . The parliament and the scots laid their proposals before the king. Humble man that he was, he will not now take a back seat.
Audio
Ours VALL-E960h
VALL-E44Kh
Ours w/o consistency constraint VALL-E960h
VALL-E44Kh
mHuBERT SpeechGPT
EnCodec VoiceCraft330M
VoiceCraft880M
Mel VQ-VAE XTTS_v2
SpeechTokenizer USLM
AnyGPT
VALL-E