DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Wang, Hongfei; Zou, Yi; Cheng, Haonan; Ye, Long

doi:10.1007/s00530-023-01220-9

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Regular Paper
Published: 17 January 2024

Volume 30, article number 19, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Hongfei Wang¹,
Yi Zou¹,
Haonan Cheng² &
…
Long Ye²

286 Accesses
Explore all metrics

Abstract

Recent advances in generative models have shown remarkable progress in music generation. However, since most existing methods focus on generating monophonic or homophonic music, the generation of polyphonic and multi-track music with rich attributes remains a challenging task. In this paper, we propose a novel image-based music generation approach DiffuseRoll, which is based on the diffusion models to generate multi-track, multi-attribute music. Specifically, we generate music piano-rolls with diffusion models and map them to MIDI format files for output. To capture rich attribute information, we design the color-encoding system to encode music note sequences into color and position information representing note pitch, velocity, tempo and instrument. This scheme enables a seamless map** between discrete music sequences and continuous images. We propose Music Mini Expert System (MusicMES) to optimize the generated music for better performance. We conduct subjective experiments in evaluation metrics, namely Coherence, Diversity, Harmoniousness, Structureness, Orchestration, Overall Preference and Average. The results of subjective experiments are improved compared to the state-of-the-art image-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LyricJam Sonic: A Generative System for Real-Time Composition and Musical Improvisation

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

From Music to Image a Computational Creativity Approach

Data availability

The project and related resources are available at https://github.com/Fairywang9/DiffuseRoll.

References

Yu, B., Lu, P., Wang, R., Hu, W., Tan, X., Ye, W., Zhang, S., Qin, T., Liu, T.-Y.: Museformer: Transformer with fine-and coarse-grained attention for music generation. ar**v preprint ar**v:2210.10349 (2022)
Liu, J., Dong, Y., Cheng, Z., Zhang, X., Li, X., Yu, F., Sun, M.: Symphony generation with permutation invariant language model. ar**v preprint ar**v:2205.05448 (2022)
Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., Yang, Y.-H.: Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Ens, J., Pasquier, P.: Mmm: Exploring conditional multi-track music generation with the transformer. ar**v preprint ar**v:2008.06048 (2020)
Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A.M., Hoffman, M.D., Eck, D.: Music transformer: Generating music with long-term structure. ar**v preprint ar**v:1809.04281 (2018)
Donahue, C., Mao, H.H., Li, Y.E., Cottrell, G.W., McAuley, J.: Lakhnes: Improving multi-instrumental music generation with cross-domain pre-training. ar**v preprint ar**v:1907.04868 (2019)
Briot, J.-P., Hadjeres, G., Pachet, F.-D.: Deep Learning Techniques for Music Generation, vol. 1. Springer (2020)
Book Google Scholar
Hernandez-Olivan, C., Beltran, J.R.: Music composition with deep learning: a review. In: Advances in Speech and Music Technology: Computational Aspects and Applications, pp. 25–50. Springer, Cham (2022)
Google Scholar
Ji, S., Luo, J., Yang, X.: A comprehensive survey on deep music generation: multi-level representations, algorithms, evaluations, and future directions. ar**v preprint ar**v:2011.06801 (2020)
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, pp. 1045–1048 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wang, S., Jiang, J.: Learning natural language inference with lstm. ar**v preprint ar**v:1512.08849 (2015)
An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2(1), 1–18 (2015)
Google Scholar
Simon, I., Oore, S.: Performance RNN: Generating Music with Expressive Timing and Dynamics (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article CAS PubMed Google Scholar
Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. In: International Conference on Machine Learning, pp. 4364–4373. PLMR (2018)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Zhang, H., **e, L., Qi, K.: Implement music generation with gan: a systematic review. In: 2021 International Conference on Computer Engineering and Application (ICCEA), pp. 352–355. IEEE (2021)
Mogren, O.: C-rnn-gan: Continuous recurrent neural networks with adversarial training. ar**v preprint ar**v:1611.09904 (2016)
Jhamtani, H., Berg-Kirkpatrick, T.: Modeling self-repetition in music generation using generative adversarial networks. In: Machine Learning for Music Discovery Workshop, ICML (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.-Y.: Meloform: Generating melody with musical form based on expert systems and neural networks. ar**v preprint ar**v:2208.14345 (2022)
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P.-A., Li, S.Z.: A survey on generative diffusion model. ar**v preprint ar**v:2209.02646 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Mittal, G., Engel, J., Hawthorne, C., Simon, I.: Symbolic music generation with diffusion models. ar**v preprint ar**v:2103.16091 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments. This work was supported by the Natural Science Foundation of China under grant No. 62201524, No. 62271455, No. 61971383 and the Fundamental Research Funds for the Central Universities under grant No. CUC23GZ016.

Author information

Authors and Affiliations

Key Laboratory of Media Audio & Video, Communication University of China, Ministry of Education, Dingfuzhuang East Street No.1, Bei**g, 100024, China
Hongfei Wang & Yi Zou
State Key Laboratory of Media Convergence and Communication, Communication University of China, Dingfuzhuang East Street No.1, Bei**g, 100024, China
Haonan Cheng & Long Ye

Authors

Hongfei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zou
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Long Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haonan Cheng.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, H., Zou, Y., Cheng, H. et al. DiffuseRoll: multi-track multi-attribute music generation based on diffusion model. Multimedia Systems 30, 19 (2024). https://doi.org/10.1007/s00530-023-01220-9

Download citation

Received: 10 August 2023
Accepted: 08 December 2023
Published: 17 January 2024
DOI: https://doi.org/10.1007/s00530-023-01220-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LyricJam Sonic: A Generative System for Real-Time Composition and Musical Improvisation

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

From Music to Image a Computational Creativity Approach

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LyricJam Sonic: A Generative System for Real-Time Composition and Musical Improvisation

MusIAC: An Extensible Generative Framework for Music Infilling Applications with Multi-level Control

From Music to Image a Computational Creativity Approach

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation