Enhancing Diffusion-Based Music Generation Performance with LoRA

Kim, Seonpyo; Kim, Geonhui; Yagishita, Shoki; Han, Daewoon; Im, Jeonghyeon; Sung, Yunsick

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Enhancing Diffusion-Based Music Generation Performance with LoRAopen access

Authors: Kim, Seonpyo; Kim, Geonhui; Yagishita, Shoki; Han, Daewoon; Im, Jeonghyeon; Sung, Yunsick

Issue Date: Aug-2025

Publisher: MDPI

Keywords: text-to-music generation; Parameter-Efficient Fine-Tuning (PEFT); low-rank adaptation (LoRA)

Citation: Applied Sciences, v.15, no.15, pp 1 - 17

Pages: 17

Indexed: SCIE
SCOPUS

Journal Title: Applied Sciences

Volume: 15

Number: 15

Start Page: 1

End Page: 17

URI: https://scholarworks.dongguk.edu/handle/sw.dongguk/58992

DOI: 10.3390/app15158646

ISSN: 2076-3417
2076-3417

Abstract: Recent advancements in generative artificial intelligence have significantly progressed the field of text-to-music generation, enabling users to create music from natural language descriptions. Despite the success of various models, such as MusicLM, MusicGen, and AudioLDM, the current approaches struggle to capture fine-grained genre-specific characteristics, precisely control musical attributes, and handle underrepresented cultural data. This paper introduces a novel, lightweight fine-tuning method for the AudioLDM framework using low-rank adaptation (LoRA). By updating only selected attention and projection layers, the proposed method enables efficient adaptation to musical genres with limited data and computational cost. The proposed method enhances controllability over key musical parameters such as rhythm, emotion, and timbre. At the same time, it maintains the overall quality of music generation. This paper represents the first application of LoRA in AudioLDM, offering a scalable solution for fine-grained, genre-aware music generation and customization. The experimental results demonstrate that the proposed method improves the semantic alignment and statistical similarity compared with the baseline. The contrastive language-audio pretraining score increased by 0.0498, indicating enhanced text-music consistency. The kernel audio distance score decreased by 0.8349, reflecting improved similarity to real music distributions. The mean opinion score ranged from 3.5 to 3.8, confirming the perceptual quality of the generated music.

Files in This Item: There are no files associated with this item.

Appears in Collections: ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Sung, Yunsick photo

Sung, Yunsick: College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE