Google Announces MusicLM—Generating Music from Text Prompts

Ever since having been astonished by the output from generative image synthesis systems such as DALL-E 2 and text generation from large language models like ChatGPT, I’m sure I’m not the only one to have wondered, “How long before the AIs start composing music?”

Well, that wasn’t long. Google has just announced MusicLM, a generative text to music synthesiser which was trained on a data set containing five million audio clips, amounting to 280,000 hours of music. Like DALL-E 2 and ChatGPT, the desired music is described by an English text prompt, which can be as generic or specific as needed, and the result is music synthesised at 24 kHz sampling rate that matches the description. For example, here is the result from the following text prompt:

Epic soundtrack using orchestral instruments. The piece builds tension, creates a sense of urgency. An a cappella chorus sing in unison, it creates a sense of power and strength.

Here are five minutes of music produced from a prompt of just “relaxing jazz”:

MusicML can create music to accompany artworks by prompting with a text description of the image. Salvador Dali’s The Persistence of Memory:

is described in Encyclypedia Britannica as:

His melting-clock imagery mocks the rigidity of chronometric time. The watches themselves look like soft cheese—indeed, by Dali s own account they were inspired by hallucinations after eating Camembert cheese. In the center of the picture, under one of the watches, is a distorted human face in profile. The ants on the plate represent decay.

Feeding this as a prompt to MusicML causes it to compose:

These are just a few of the hundreds of examples in the page “MusicLM: Generating Music from Text” published on Google Research’s GitHub site. Visit it and be amazed.

Technical details of how the model operates and was trained are in the research paper (confusingly with the same title) “MusicLM: Generating Music From Text” posted on arXiv on 2023-01-26 with full text [PDF] linked to the paper. Here is the abstract.

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

Here is the MusicCaps data set for music captioning, which you can download from GitHub.

MusicML is not available for public access at this time. If and when that happens, the market for human-composed background and mood music for videos evaporates.

Here we go again—time to once again haul out the picture from Max Tegmark’s Life 3.0.

I’ll leave you with:

Slow tempo, bass-and-drums-led reggae song. Sustained electric guitar. High-pitched bongos with ringing tones. Vocals are relaxed with a laid-back feel, very expressive.


Damn! The machine still needs those human serfs to generate audio clips and shovel them into the data base. :slight_smile:

More seriously, it suggests that the real test of Artificial Intelligence will be to come up with something – anything – that is not derivative of human creativity.


This sounds like a “fighting retreat” slogan. How much of anything that human intelligence creates is not derivative in numerous ways from previous human creativity or something which exists in the natural world? When a composer writes a new piece of music, they incorporate the musical language of the genre (for example, the chromatic scale of 12 notes in Western music, conventions of note length, tempo, rhythm, and a tower of higher level structure). Composers who have tried to liberate themselves from these shackles, such as the serialists (Schönberg, Berg, Webern, etc.) and those who introduced randomness into their music (Charles Ives, John Cage), often ended up producing music which resembled the output of crude 1960s computer algorithmic compositions which were, to many, unlistenable and boring.

Composers train themselves by listening to and reading music, absorbing its conventions, and then trying to express something new within that system, or expanding the system in various ways. If they veer too close to the work of others, they may find themselves in a copyright fight, even when the similarity was coincidental and unintended.

I don’t see a big difference in what a generative model that produces text, images, or music does, having been trained on a variety of existing material to absorb its conventions, from what a human does, absorbing a corpus of material and then building on it to produce something new. Both the AI and the human produce novel work which resembles earlier work in structure but is in no way a cut-and-paste of the work on which the training was done.

In popular music, most melodies are 12 or fewer notes of the chromatic scale. If we restrict ourselves to the C scale (other scales are a trivial key transposition), there are only 68,719,476,736 (68.7 billion) possible melodies. Expressed in the simplest, uncompressed form, this fits easily on a 1 terabyte drive with space left over. This is not just a mental exercise. See the 2020 article in The Atlantic, “The Hard Drive With 68 Billion Melodies”, about Damien Riehl and Noah Rubin, who cranked out all 68.7 billion melodies, claimed the “born in copyright” doctrine for them for their All the Music LLC, and then contributed them to the public domain under the Creative Commons Zero license. See the All the Music Web site and its FAQ page for more information.

Here is TEDx talk by Damien Riehl describing the project, technology, and goal.