Abstract
Recent advances in generative models have shown remarkable progress in music generation. However, since most existing methods focus on generating monophonic or homophonic music, the generation of polyphonic and multi-track music with rich attributes remains a challenging task. In this paper, we propose a novel image-based music generation approach DiffuseRoll, which is based on the diffusion models to generate multi-track, multi-attribute music. Specifically, we generate music piano-rolls with diffusion models and map them to MIDI format files for output. To capture rich attribute information, we design the color-encoding system to encode music note sequences into color and position information representing note pitch, velocity, tempo and instrument. This scheme enables a seamless map** between discrete music sequences and continuous images. We propose Music Mini Expert System (MusicMES) to optimize the generated music for better performance. We conduct subjective experiments in evaluation metrics, namely Coherence, Diversity, Harmoniousness, Structureness, Orchestration, Overall Preference and Average. The results of subjective experiments are improved compared to the state-of-the-art image-based methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Figa_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00530-023-01220-9/MediaObjects/530_2023_1220_Fig6_HTML.png)
Similar content being viewed by others
Data availability
The project and related resources are available at https://github.com/Fairywang9/DiffuseRoll.
References
Yu, B., Lu, P., Wang, R., Hu, W., Tan, X., Ye, W., Zhang, S., Qin, T., Liu, T.-Y.: Museformer: Transformer with fine-and coarse-grained attention for music generation. ar**v preprint ar**v:2210.10349 (2022)
Liu, J., Dong, Y., Cheng, Z., Zhang, X., Li, X., Yu, F., Sun, M.: Symphony generation with permutation invariant language model. ar**v preprint ar**v:2205.05448 (2022)
Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., Yang, Y.-H.: Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Ens, J., Pasquier, P.: Mmm: Exploring conditional multi-track music generation with the transformer. ar**v preprint ar**v:2008.06048 (2020)
Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A.M., Hoffman, M.D., Eck, D.: Music transformer: Generating music with long-term structure. ar**v preprint ar**v:1809.04281 (2018)
Donahue, C., Mao, H.H., Li, Y.E., Cottrell, G.W., McAuley, J.: Lakhnes: Improving multi-instrumental music generation with cross-domain pre-training. ar**v preprint ar**v:1907.04868 (2019)
Briot, J.-P., Hadjeres, G., Pachet, F.-D.: Deep Learning Techniques for Music Generation, vol. 1. Springer (2020)
Hernandez-Olivan, C., Beltran, J.R.: Music composition with deep learning: a review. In: Advances in Speech and Music Technology: Computational Aspects and Applications, pp. 25–50. Springer, Cham (2022)
Ji, S., Luo, J., Yang, X.: A comprehensive survey on deep music generation: multi-level representations, algorithms, evaluations, and future directions. ar**v preprint ar**v:2011.06801 (2020)
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, pp. 1045–1048 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wang, S., Jiang, J.: Learning natural language inference with lstm. ar**v preprint ar**v:1512.08849 (2015)
An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2(1), 1–18 (2015)
Simon, I., Oore, S.: Performance RNN: Generating Music with Expressive Timing and Dynamics (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. In: International Conference on Machine Learning, pp. 4364–4373. PLMR (2018)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Zhang, H., **e, L., Qi, K.: Implement music generation with gan: a systematic review. In: 2021 International Conference on Computer Engineering and Application (ICCEA), pp. 352–355. IEEE (2021)
Mogren, O.: C-rnn-gan: Continuous recurrent neural networks with adversarial training. ar**v preprint ar**v:1611.09904 (2016)
Jhamtani, H., Berg-Kirkpatrick, T.: Modeling self-repetition in music generation using generative adversarial networks. In: Machine Learning for Music Discovery Workshop, ICML (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.-Y.: Meloform: Generating melody with musical form based on expert systems and neural networks. ar**v preprint ar**v:2208.14345 (2022)
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P.-A., Li, S.Z.: A survey on generative diffusion model. ar**v preprint ar**v:2209.02646 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Mittal, G., Engel, J., Hawthorne, C., Simon, I.: Symbolic music generation with diffusion models. ar**v preprint ar**v:2103.16091 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 (2017)
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful comments. This work was supported by the Natural Science Foundation of China under grant No. 62201524, No. 62271455, No. 61971383 and the Fundamental Research Funds for the Central Universities under grant No. CUC23GZ016.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, H., Zou, Y., Cheng, H. et al. DiffuseRoll: multi-track multi-attribute music generation based on diffusion model. Multimedia Systems 30, 19 (2024). https://doi.org/10.1007/s00530-023-01220-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-023-01220-9