Log in

DiffuseRoll: multi-track multi-attribute music generation based on diffusion model

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Recent advances in generative models have shown remarkable progress in music generation. However, since most existing methods focus on generating monophonic or homophonic music, the generation of polyphonic and multi-track music with rich attributes remains a challenging task. In this paper, we propose a novel image-based music generation approach DiffuseRoll, which is based on the diffusion models to generate multi-track, multi-attribute music. Specifically, we generate music piano-rolls with diffusion models and map them to MIDI format files for output. To capture rich attribute information, we design the color-encoding system to encode music note sequences into color and position information representing note pitch, velocity, tempo and instrument. This scheme enables a seamless map** between discrete music sequences and continuous images. We propose Music Mini Expert System (MusicMES) to optimize the generated music for better performance. We conduct subjective experiments in evaluation metrics, namely Coherence, Diversity, Harmoniousness, Structureness, Orchestration, Overall Preference and Average. The results of subjective experiments are improved compared to the state-of-the-art image-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The project and related resources are available at https://github.com/Fairywang9/DiffuseRoll.

References

  1. Yu, B., Lu, P., Wang, R., Hu, W., Tan, X., Ye, W., Zhang, S., Qin, T., Liu, T.-Y.: Museformer: Transformer with fine-and coarse-grained attention for music generation. ar**v preprint ar**v:2210.10349 (2022)

  2. Liu, J., Dong, Y., Cheng, Z., Zhang, X., Li, X., Yu, F., Sun, M.: Symphony generation with permutation invariant language model. ar**v preprint ar**v:2205.05448 (2022)

  3. Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., Yang, Y.-H.: Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  4. Ens, J., Pasquier, P.: Mmm: Exploring conditional multi-track music generation with the transformer. ar**v preprint ar**v:2008.06048 (2020)

  5. Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A.M., Hoffman, M.D., Eck, D.: Music transformer: Generating music with long-term structure. ar**v preprint ar**v:1809.04281 (2018)

  6. Donahue, C., Mao, H.H., Li, Y.E., Cottrell, G.W., McAuley, J.: Lakhnes: Improving multi-instrumental music generation with cross-domain pre-training. ar**v preprint ar**v:1907.04868 (2019)

  7. Briot, J.-P., Hadjeres, G., Pachet, F.-D.: Deep Learning Techniques for Music Generation, vol. 1. Springer (2020)

    Book  Google Scholar 

  8. Hernandez-Olivan, C., Beltran, J.R.: Music composition with deep learning: a review. In: Advances in Speech and Music Technology: Computational Aspects and Applications, pp. 25–50. Springer, Cham (2022)

    Google Scholar 

  9. Ji, S., Luo, J., Yang, X.: A comprehensive survey on deep music generation: multi-level representations, algorithms, evaluations, and future directions. ar**v preprint ar**v:2011.06801 (2020)

  10. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)

  11. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, pp. 1045–1048 (2010)

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  13. Wang, S., Jiang, J.: Learning natural language inference with lstm. ar**v preprint ar**v:1512.08849 (2015)

  14. An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2(1), 1–18 (2015)

    Google Scholar 

  15. Simon, I., Oore, S.: Performance RNN: Generating Music with Expressive Timing and Dynamics (2017)

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  CAS  PubMed  Google Scholar 

  17. Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. In: International Conference on Machine Learning, pp. 4364–4373. PLMR (2018)

  18. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  19. Zhang, H., **e, L., Qi, K.: Implement music generation with gan: a systematic review. In: 2021 International Conference on Computer Engineering and Application (ICCEA), pp. 352–355. IEEE (2021)

  20. Mogren, O.: C-rnn-gan: Continuous recurrent neural networks with adversarial training. ar**v preprint ar**v:1611.09904 (2016)

  21. Jhamtani, H., Berg-Kirkpatrick, T.: Modeling self-repetition in music generation using generative adversarial networks. In: Machine Learning for Music Discovery Workshop, ICML (2019)

  22. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

  23. Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.-Y.: Meloform: Generating melody with musical form based on expert systems and neural networks. ar**v preprint ar**v:2208.14345 (2022)

  24. Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P.-A., Li, S.Z.: A survey on generative diffusion model. ar**v preprint ar**v:2209.02646 (2022)

  25. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  26. Mittal, G., Engel, J., Hawthorne, C., Simon, I.: Symbolic music generation with diffusion models. ar**v preprint ar**v:2103.16091 (2021)

  27. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments. This work was supported by the Natural Science Foundation of China under grant No. 62201524, No. 62271455, No. 61971383 and the Fundamental Research Funds for the Central Universities under grant No. CUC23GZ016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haonan Cheng.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Zou, Y., Cheng, H. et al. DiffuseRoll: multi-track multi-attribute music generation based on diffusion model. Multimedia Systems 30, 19 (2024). https://doi.org/10.1007/s00530-023-01220-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01220-9

Keywords

Navigation