MIDILM: A Dual-Path Model for Controllable Text-to-MIDI Generation

Li, Shuyu; Choi, Dooho; Sung, Yunsick

doi:10.1609/aaai.v40i28.39483

상세 보기

MIDILM: A Dual-Path Model for Controllable Text-to-MIDI Generation

Li, Shuyu;
Choi, Dooho;
Sung, Yunsick

Citations

SCOPUS

0

초록

Text-to-MIDI generation offers editable and hierarchical control over symbolic music generation. Previous approaches either convert text into a limited set of musical attributes and generate music based on these attributes, which limits semantic controllability, or use end-to-end models that map text directly to music without deeply aligning the features of both modalities, often resulting in a lack of structural coherence and mismatches in key, meter, and tempo. We propose MIDILM, which addresses these limitations by employing text conditioning with a dual-path decoder that processes textual and musical information through separate feedforward paths following a shared masked self-attention mechanism. On the MidiCaps benchmark, MIDILM outperformed the strongest baseline, with relative improvements ranging from 6.07% on CLAP to 144.77% on TB across semantic alignment and structural metrics. These gains confirm its ability to enhance both semantic controllability and structural coherence. Collectively, we expect that MIDILM will serve as a useful reference framework for future investigations into controllable and structurally faithful cross-modal music generation. © 2026, Association for the Advancement of Artificial Intelligence. All rights reserved.

제목: MIDILM: A Dual-Path Model for Controllable Text-to-MIDI Generation

저자: Li, Shuyu; Choi, Dooho; Sung, Yunsick

DOI: 10.1609/aaai.v40i28.39483

발행일: 2026

유형: Conference paper

저널명: Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence

권: 40

호: 28

페이지: 23160 ~ 23168