Token-Based Audio Inpainting via Discrete Diffusion

Tali Dror*, Iftach Shoham*, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
Ben-Gurion University of the Negev ยท University of Haifa
*Equal contribution
AIDD overview
Overview of AIDD. Audio is tokenized using WavTokenizer, inpainted via discrete diffusion, and decoded back to waveform audio.

Abstract

Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations, enabling stable and semantically coherent restoration of long gaps. Our method incorporates span-based masking and a derivative-based regularization loss, and achieves state-of-the-art performance on MusicNet and MAESTRO for gaps up to 750 ms.

Method

AIDD operates entirely in a discrete token space. Raw audio waveforms are first converted into compact sequences of discrete tokens using a pretrained WavTokenizer. A Diffusion Transformer (DiT) then performs inpainting by learning the reverse discrete diffusion process with an absorbing mask state.

To better model structured missing regions, we introduce span-based masking during training. Additionally, a derivative-based regularization loss encourages smooth temporal dynamics in token embedding space, improving perceptual coherence of the reconstructed audio.

Results

AIDD achieves competitive or state-of-the-art results on MusicNet and MAESTRO across objective metrics (FAD, LSD, ODG) and subjective MOS evaluations, particularly for medium and long gaps (150โ€“750 ms), while being more efficient than prior diffusion-based approaches.

Audio Examples

Qualitative examples from MAESTRO with 375 ms and 750 ms gaps. Each row shows Original, Masked, and AIDD Reconstruction.

Gap: 375 ms

Original Masked AIDD

Gap: 750 ms

Original Masked AIDD

Citation

@article{dror2025token,
  title={Token-based Audio Inpainting via Discrete Diffusion},
  author={Dror, Tali and Shoham, Iftach and Buchris, Moshe and Gal, Oren and Permuter, Haim and Katz, Gilad and Nachmani, Eliya},
  journal={arXiv preprint arXiv:2507.08333},
  year={2025}
}
    

Acknowledgments

This work is built upon prior works, including Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution and WavTokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. We are grateful to the authors for making their methods and code publicly available.