Audiocraft depends on what Meta calls “EnCodec Neural Audio Codec,” which processes audio in the identical tokenized format as your common AI chatbots like ChatGPT or Bard. From the samples shared by Meta thus far, it appears you possibly can dictate the kind of tones you need and the voice sources — which could be a musical instrumental or every other object starting from a chicken to a bus — to generate a sound clip utilizing a textual content immediate.
Here is a pattern of a textual content immediate: “Earthy tones, environmentally acutely aware, ukulele-infused, harmonic, breezy, easygoing, natural instrumentation, mild grooves.” It produces a 30-second clip, which truly does not sound half-bad, as you possibly can take heed to right here in Meta’s weblog publish. As handy because it sounds, you will not have a lot granular management over producing your sound clips as you’d have with an actual instrument in your fingers or knowledgeable synth.
MusicGen, which Meta claims was “particularly tailor-made for music era,” was educated utilizing roughly 400,000 recordings and metadata value 20,000 hours of music. However as soon as once more, the range of the coaching knowledge is an issue and Meta acknowledges that, too. The coaching dataset is predominantly Western-style music with corresponding audio-text knowledge fed within the English language. To place it merely, you’ll have higher luck producing a rustic music-inspired tune as a substitute of a people Persian melody. One of many key targets behind pushing the undertaking into the open-source world is to work on the range facet, it appears.