AIDD operates entirely in a discrete token space. Raw audio waveforms are first converted into compact sequences of discrete tokens using a pretrained WavTokenizer. A Diffusion Transformer (DiT) then performs inpainting by learning the reverse discrete diffusion process with an absorbing mask state.
To better model structured missing regions, we introduce span-based masking during training. Additionally, a derivative-based regularization loss encourages smooth temporal dynamics in token embedding space, improving perceptual coherence of the reconstructed audio.
AIDD achieves competitive or state-of-the-art results on MusicNet and MAESTRO across objective metrics (FAD, LSD, ODG) and subjective MOS evaluations, particularly for medium and long gaps (150โ750 ms), while being more efficient than prior diffusion-based approaches.
Qualitative examples from MAESTRO with 375 ms and 750 ms gaps. Each row shows Original, Masked, and AIDD Reconstruction.
| Original | Masked | AIDD |
|---|---|---|
| Original | Masked | AIDD |
|---|---|---|
@article{dror2025token,
title={Token-based Audio Inpainting via Discrete Diffusion},
author={Dror, Tali and Shoham, Iftach and Buchris, Moshe and Gal, Oren and Permuter, Haim and Katz, Gilad and Nachmani, Eliya},
journal={arXiv preprint arXiv:2507.08333},
year={2025}
}
This work is built upon prior works, including Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution and WavTokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. We are grateful to the authors for making their methods and code publicly available.