Analyzing and Mitigating Inconsistency in Discrete Speech Tokens for Neural Codec Language Models

Congratulation: Our work is accepted by ACL 2025!
Abstract. Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training speech generation tasks with discrete speech token sequences. However, directly discretizing speech by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete speech tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as Discrete Representation Inconsistency (DRI). This inconsistency can lead to a single speech segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in poor generated speech. In this paper, we quantitatively analyze the DRI phenomenon within popular speech tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS dataset (44,000 hours) demonstrate the effectiveness and generality of our method.

2 Analysis on DRI Phenomenon within Neural Audio Codecs

To analyze the DRI phenomenon, we use neural audio codecs as audio tokenizers to quantize both the entire audio and an audio segment within that audio, and then compare the results of their corresponding discrete audio token sequences. As you can hear, these two audio segments are exactly identical with the only difference being whether there is context, but the discrete audio token sequences are very different, which raises uncertainty for neural codec language models in predicting the next token.

Ground Truth	Transcript
	I believe, Charlie, he recommenced suddenly, there is not such an unnatural family on record as ours; is there?

Neural Audio Codec	Audio Tokens	Audio Tokens from Audio Segments	First Layer's Consistency Accuracy	First 3 Layers' Consistency Accuracy
Ours	..., 894, 894, 12, 894, 894, 256, 1009, 928, 895, 798, 498, 303, 731, 367, 388, 663, 396, 95, 155, 937, 577, 933, 595, 33, 715, 33, 33, 535, 535, 15, 587, 20, 20, 280, 21, 21, 327, 21, 17, 17, 150, 17, 150, 126, 126, 126, 126, 126, 126, 126 ...	..., 894, 894, 12, 894, 894, 256, 1009, 928, 895, 798, 498, 303, 731, 367, 388, 663, 396, 95, 155, 937, 577, 933, 595, 33, 715, 33, 33, 535, 535, 15, 587, 20, 20, 280, 21, 21, 327, 21, 17, 17, 150, 17, 150, 126, 126, 126, 126, 126, 126, 126, ...	100.00%	92.00%
Ours w/o consistency constraint	..., 40, 40, 40, 1020, 549, 432, 439, 923, 193, 642, 375, 570, 282, 745, 864, 317, 807, 184, 807, 723, 566, 861, 128, 934, 934, 717, 150, 507, 592, 738, 943, 786, 166, 786, 786, 166, 786, 786, 786, 117, 117, 117, 117, 117, 117, 117, 882, 882, 882, 882, ...	..., 171, 171, 171, 549, 549, 233, 439, 923, 158, 433, 956, 570, 938, 367, 151, 317, 807, 156, 5, 777, 145, 956, 498, 934, 453, 453, 934, 934, 1023, 679, 784, 784, 784, 422, 422, 784, 784, 784, 422, 422, 846, 784, 407, 882, 637, 637, 515, 637, 637, 445, ...	14.00%	8.00%
EnCodec	..., 463, 373, 463, 463, 53, 373, 373, 731, 461, 537, 457, 392, 233, 679, 185, 112, 185, 432, 699, 136, 321, 967, 136, 321, 224, 136, 321, 224, 984, 1008, 679, 788, 465, 906, 151, 151, 151, 151, 151, 976, 491, 834, 430, 835, 408, 408, 408, 408, 408, 408, 408, 408, 62, 408, 408, 408, 408, 408, 835, 835, 835, 835, 835, 835, 475, 475, 475, 475, 475, 475, 25, 475, 25, 779, 475, ...	..., 463, 463, 463, 463, 53, 373, 373, 731, 731, 25, 457, 392, 233, 679, 185, 747, 185, 432, 699, 224, 321, 967, 136, 321, 224, 136, 321, 224, 984, 1008, 868, 533, 151, 906, 151, 151, 151, 151, 276, 976, 491, 834, 430, 835, 408, 62, 408, 408, 62, 408, 408, 408, 408, 408, 408, 408, 835, 835, 835, 835, 62, 835, 835, 835, 835, 475, 475, 25, 475, 475, 25, 475, 25, 25, 47, ...	76.00%	58.67%

Neural Audio Codec

Reconstructed Audio

Audio Tokens

Reconstructed Audio Segments

Audio Tokens from Audio Segments

First Layer's Consistency Accuracy

First 3 Layers' Consistency Accuracy

Ours

..., 
894, 894, 12, 894, 894, 
256, 1009, 928, 895, 798, 
498, 303, 731, 367, 388, 
663, 396, 95, 155, 937, 
577, 933, 595, 33, 715, 
33, 33, 535, 535, 15, 
587, 20, 20, 280, 21, 
21, 327, 21, 17, 17, 
150, 17, 150, 126, 126, 
126, 126, 126, 126, 126 
...

...,
894, 894, 12, 894, 894, 
256, 1009, 928, 895, 798, 
498, 303, 731, 367, 388, 
663, 396, 95, 155, 937, 
577, 933, 595, 33, 715, 
33, 33, 535, 535, 15, 
587, 20, 20, 280, 21, 
21, 327, 21, 17, 17, 
150, 17, 150, 126, 126, 
126, 126, 126, 126, 126,
...

100.00%

92.00%

Ours w/o consistency constraint

..., 
40, 40, 40, 1020, 549, 
432, 439, 923, 193, 642, 
375, 570, 282, 745, 864, 
317, 807, 184, 807, 723, 
566, 861, 128, 934, 934, 
717, 150, 507, 592, 738, 
943, 786, 166, 786, 786, 
166, 786, 786, 786, 117, 
117, 117, 117, 117, 117, 
117, 882, 882, 882, 882, 
...

...,
171, 171, 171, 549, 549, 
233, 439, 923, 158, 433, 
956, 570, 938, 367, 151, 
317, 807, 156, 5, 777, 
145, 956, 498, 934, 453, 
453, 934, 934, 1023, 679, 
784, 784, 784, 422, 422, 
784, 784, 784, 422, 422, 
846, 784, 407, 882, 637, 
637, 515, 637, 637, 445,
...

14.00%

8.00%

EnCodec

..., 
463, 373, 463, 463, 53, 
373, 373, 731, 461, 537, 
457, 392, 233, 679, 185, 
112, 185, 432, 699, 136, 
321, 967, 136, 321, 224, 
136, 321, 224, 984, 1008, 
679, 788, 465, 906, 151, 
151, 151, 151, 151, 976, 
491, 834, 430, 835, 408, 
408, 408, 408, 408, 408, 
408, 408, 62, 408, 408, 
408, 408, 408, 835, 835, 
835, 835, 835, 835, 475, 
475, 475, 475, 475, 475, 
25, 475, 25, 779, 475,
...

...,
463, 463, 463, 463, 53, 
373, 373, 731, 731, 25, 
457, 392, 233, 679, 185, 
747, 185, 432, 699, 224, 
321, 967, 136, 321, 224, 
136, 321, 224, 984, 1008, 
868, 533, 151, 906, 151, 
151, 151, 151, 276, 976, 
491, 834, 430, 835, 408, 
62, 408, 408, 62, 408, 
408, 408, 408, 408, 408, 
408, 835, 835, 835, 835, 
62, 835, 835, 835, 835, 
475, 475, 25, 475, 475, 
25, 475, 25, 25, 47,
...