568 lines
16 KiB
Markdown
568 lines
16 KiB
Markdown
---
|
|
name: audiocraft-audio-generation
|
|
description: PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation.
|
|
version: 1.0.0
|
|
author: Orchestra Research
|
|
license: MIT
|
|
dependencies: [audiocraft, torch>=2.0.0, transformers>=4.30.0]
|
|
metadata:
|
|
hermes:
|
|
tags: [Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen]
|
|
|
|
---
|
|
|
|
# AudioCraft: Audio Generation
|
|
|
|
Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
|
|
|
|
## When to use AudioCraft
|
|
|
|
**Use AudioCraft when:**
|
|
- Need to generate music from text descriptions
|
|
- Creating sound effects and environmental audio
|
|
- Building music generation applications
|
|
- Need melody-conditioned music generation
|
|
- Want stereo audio output
|
|
- Require controllable music generation with style transfer
|
|
|
|
**Key features:**
|
|
- **MusicGen**: Text-to-music generation with melody conditioning
|
|
- **AudioGen**: Text-to-sound effects generation
|
|
- **EnCodec**: High-fidelity neural audio codec
|
|
- **Multiple model sizes**: Small (300M) to Large (3.3B)
|
|
- **Stereo support**: Full stereo audio generation
|
|
- **Style conditioning**: MusicGen-Style for reference-based generation
|
|
|
|
**Use alternatives instead:**
|
|
- **Stable Audio**: For longer commercial music generation
|
|
- **Bark**: For text-to-speech with music/sound effects
|
|
- **Riffusion**: For spectogram-based music generation
|
|
- **OpenAI Jukebox**: For raw audio generation with lyrics
|
|
|
|
## Quick start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# From PyPI
|
|
pip install audiocraft
|
|
|
|
# From GitHub (latest)
|
|
pip install git+https://github.com/facebookresearch/audiocraft.git
|
|
|
|
# Or use HuggingFace Transformers
|
|
pip install transformers torch torchaudio
|
|
```
|
|
|
|
### Basic text-to-music (AudioCraft)
|
|
|
|
```python
|
|
import torchaudio
|
|
from audiocraft.models import MusicGen
|
|
|
|
# Load model
|
|
model = MusicGen.get_pretrained('facebook/musicgen-small')
|
|
|
|
# Set generation parameters
|
|
model.set_generation_params(
|
|
duration=8, # seconds
|
|
top_k=250,
|
|
temperature=1.0
|
|
)
|
|
|
|
# Generate from text
|
|
descriptions = ["happy upbeat electronic dance music with synths"]
|
|
wav = model.generate(descriptions)
|
|
|
|
# Save audio
|
|
torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
|
|
```
|
|
|
|
### Using HuggingFace Transformers
|
|
|
|
```python
|
|
from transformers import AutoProcessor, MusicgenForConditionalGeneration
|
|
import scipy
|
|
|
|
# Load model and processor
|
|
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
|
|
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
|
|
model.to("cuda")
|
|
|
|
# Generate music
|
|
inputs = processor(
|
|
text=["80s pop track with bassy drums and synth"],
|
|
padding=True,
|
|
return_tensors="pt"
|
|
).to("cuda")
|
|
|
|
audio_values = model.generate(
|
|
**inputs,
|
|
do_sample=True,
|
|
guidance_scale=3,
|
|
max_new_tokens=256
|
|
)
|
|
|
|
# Save
|
|
sampling_rate = model.config.audio_encoder.sampling_rate
|
|
scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
|
|
```
|
|
|
|
### Text-to-sound with AudioGen
|
|
|
|
```python
|
|
from audiocraft.models import AudioGen
|
|
|
|
# Load AudioGen
|
|
model = AudioGen.get_pretrained('facebook/audiogen-medium')
|
|
|
|
model.set_generation_params(duration=5)
|
|
|
|
# Generate sound effects
|
|
descriptions = ["dog barking in a park with birds chirping"]
|
|
wav = model.generate(descriptions)
|
|
|
|
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
|
|
```
|
|
|
|
## Core concepts
|
|
|
|
### Architecture overview
|
|
|
|
```
|
|
AudioCraft Architecture:
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Text Encoder (T5) │
|
|
│ │ │
|
|
│ Text Embeddings │
|
|
└────────────────────────┬─────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────▼─────────────────────────────────────┐
|
|
│ Transformer Decoder (LM) │
|
|
│ Auto-regressively generates audio tokens │
|
|
│ Using efficient token interleaving patterns │
|
|
└────────────────────────┬─────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────▼─────────────────────────────────────┐
|
|
│ EnCodec Audio Decoder │
|
|
│ Converts tokens back to audio waveform │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Model variants
|
|
|
|
| Model | Size | Description | Use Case |
|
|
|-------|------|-------------|----------|
|
|
| `musicgen-small` | 300M | Text-to-music | Quick generation |
|
|
| `musicgen-medium` | 1.5B | Text-to-music | Balanced |
|
|
| `musicgen-large` | 3.3B | Text-to-music | Best quality |
|
|
| `musicgen-melody` | 1.5B | Text + melody | Melody conditioning |
|
|
| `musicgen-melody-large` | 3.3B | Text + melody | Best melody |
|
|
| `musicgen-stereo-*` | Varies | Stereo output | Stereo generation |
|
|
| `musicgen-style` | 1.5B | Style transfer | Reference-based |
|
|
| `audiogen-medium` | 1.5B | Text-to-sound | Sound effects |
|
|
|
|
### Generation parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `duration` | 8.0 | Length in seconds (1-120) |
|
|
| `top_k` | 250 | Top-k sampling |
|
|
| `top_p` | 0.0 | Nucleus sampling (0 = disabled) |
|
|
| `temperature` | 1.0 | Sampling temperature |
|
|
| `cfg_coef` | 3.0 | Classifier-free guidance |
|
|
|
|
## MusicGen usage
|
|
|
|
### Text-to-music generation
|
|
|
|
```python
|
|
from audiocraft.models import MusicGen
|
|
import torchaudio
|
|
|
|
model = MusicGen.get_pretrained('facebook/musicgen-medium')
|
|
|
|
# Configure generation
|
|
model.set_generation_params(
|
|
duration=30, # Up to 30 seconds
|
|
top_k=250, # Sampling diversity
|
|
top_p=0.0, # 0 = use top_k only
|
|
temperature=1.0, # Creativity (higher = more varied)
|
|
cfg_coef=3.0 # Text adherence (higher = stricter)
|
|
)
|
|
|
|
# Generate multiple samples
|
|
descriptions = [
|
|
"epic orchestral soundtrack with strings and brass",
|
|
"chill lo-fi hip hop beat with jazzy piano",
|
|
"energetic rock song with electric guitar"
|
|
]
|
|
|
|
# Generate (returns [batch, channels, samples])
|
|
wav = model.generate(descriptions)
|
|
|
|
# Save each
|
|
for i, audio in enumerate(wav):
|
|
torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
|
|
```
|
|
|
|
### Melody-conditioned generation
|
|
|
|
```python
|
|
from audiocraft.models import MusicGen
|
|
import torchaudio
|
|
|
|
# Load melody model
|
|
model = MusicGen.get_pretrained('facebook/musicgen-melody')
|
|
model.set_generation_params(duration=30)
|
|
|
|
# Load melody audio
|
|
melody, sr = torchaudio.load("melody.wav")
|
|
|
|
# Generate with melody conditioning
|
|
descriptions = ["acoustic guitar folk song"]
|
|
wav = model.generate_with_chroma(descriptions, melody, sr)
|
|
|
|
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
|
|
```
|
|
|
|
### Stereo generation
|
|
|
|
```python
|
|
from audiocraft.models import MusicGen
|
|
|
|
# Load stereo model
|
|
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
|
|
model.set_generation_params(duration=15)
|
|
|
|
descriptions = ["ambient electronic music with wide stereo panning"]
|
|
wav = model.generate(descriptions)
|
|
|
|
# wav shape: [batch, 2, samples] for stereo
|
|
print(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
|
|
torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
|
|
```
|
|
|
|
### Audio continuation
|
|
|
|
```python
|
|
from transformers import AutoProcessor, MusicgenForConditionalGeneration
|
|
|
|
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
|
|
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
|
|
|
|
# Load audio to continue
|
|
import torchaudio
|
|
audio, sr = torchaudio.load("intro.wav")
|
|
|
|
# Process with text and audio
|
|
inputs = processor(
|
|
audio=audio.squeeze().numpy(),
|
|
sampling_rate=sr,
|
|
text=["continue with a epic chorus"],
|
|
padding=True,
|
|
return_tensors="pt"
|
|
)
|
|
|
|
# Generate continuation
|
|
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
|
|
```
|
|
|
|
## MusicGen-Style usage
|
|
|
|
### Style-conditioned generation
|
|
|
|
```python
|
|
from audiocraft.models import MusicGen
|
|
|
|
# Load style model
|
|
model = MusicGen.get_pretrained('facebook/musicgen-style')
|
|
|
|
# Configure generation with style
|
|
model.set_generation_params(
|
|
duration=30,
|
|
cfg_coef=3.0,
|
|
cfg_coef_beta=5.0 # Style influence
|
|
)
|
|
|
|
# Configure style conditioner
|
|
model.set_style_conditioner_params(
|
|
eval_q=3, # RVQ quantizers (1-6)
|
|
excerpt_length=3.0 # Style excerpt length
|
|
)
|
|
|
|
# Load style reference
|
|
style_audio, sr = torchaudio.load("reference_style.wav")
|
|
|
|
# Generate with text + style
|
|
descriptions = ["upbeat dance track"]
|
|
wav = model.generate_with_style(descriptions, style_audio, sr)
|
|
```
|
|
|
|
### Style-only generation (no text)
|
|
|
|
```python
|
|
# Generate matching style without text prompt
|
|
model.set_generation_params(
|
|
duration=30,
|
|
cfg_coef=3.0,
|
|
cfg_coef_beta=None # Disable double CFG for style-only
|
|
)
|
|
|
|
wav = model.generate_with_style([None], style_audio, sr)
|
|
```
|
|
|
|
## AudioGen usage
|
|
|
|
### Sound effect generation
|
|
|
|
```python
|
|
from audiocraft.models import AudioGen
|
|
import torchaudio
|
|
|
|
model = AudioGen.get_pretrained('facebook/audiogen-medium')
|
|
model.set_generation_params(duration=10)
|
|
|
|
# Generate various sounds
|
|
descriptions = [
|
|
"thunderstorm with heavy rain and lightning",
|
|
"busy city traffic with car horns",
|
|
"ocean waves crashing on rocks",
|
|
"crackling campfire in forest"
|
|
]
|
|
|
|
wav = model.generate(descriptions)
|
|
|
|
for i, audio in enumerate(wav):
|
|
torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
|
|
```
|
|
|
|
## EnCodec usage
|
|
|
|
### Audio compression
|
|
|
|
```python
|
|
from audiocraft.models import CompressionModel
|
|
import torch
|
|
import torchaudio
|
|
|
|
# Load EnCodec
|
|
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
|
|
|
|
# Load audio
|
|
wav, sr = torchaudio.load("audio.wav")
|
|
|
|
# Ensure correct sample rate
|
|
if sr != 32000:
|
|
resampler = torchaudio.transforms.Resample(sr, 32000)
|
|
wav = resampler(wav)
|
|
|
|
# Encode to tokens
|
|
with torch.no_grad():
|
|
encoded = model.encode(wav.unsqueeze(0))
|
|
codes = encoded[0] # Audio codes
|
|
|
|
# Decode back to audio
|
|
with torch.no_grad():
|
|
decoded = model.decode(codes)
|
|
|
|
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
|
|
```
|
|
|
|
## Common workflows
|
|
|
|
### Workflow 1: Music generation pipeline
|
|
|
|
```python
|
|
import torch
|
|
import torchaudio
|
|
from audiocraft.models import MusicGen
|
|
|
|
class MusicGenerator:
|
|
def __init__(self, model_name="facebook/musicgen-medium"):
|
|
self.model = MusicGen.get_pretrained(model_name)
|
|
self.sample_rate = 32000
|
|
|
|
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
|
|
self.model.set_generation_params(
|
|
duration=duration,
|
|
top_k=250,
|
|
temperature=temperature,
|
|
cfg_coef=cfg
|
|
)
|
|
|
|
with torch.no_grad():
|
|
wav = self.model.generate([prompt])
|
|
|
|
return wav[0].cpu()
|
|
|
|
def generate_batch(self, prompts, duration=30):
|
|
self.model.set_generation_params(duration=duration)
|
|
|
|
with torch.no_grad():
|
|
wav = self.model.generate(prompts)
|
|
|
|
return wav.cpu()
|
|
|
|
def save(self, audio, path):
|
|
torchaudio.save(path, audio, sample_rate=self.sample_rate)
|
|
|
|
# Usage
|
|
generator = MusicGenerator()
|
|
audio = generator.generate(
|
|
"epic cinematic orchestral music",
|
|
duration=30,
|
|
temperature=1.0
|
|
)
|
|
generator.save(audio, "epic_music.wav")
|
|
```
|
|
|
|
### Workflow 2: Sound design batch processing
|
|
|
|
```python
|
|
import json
|
|
from pathlib import Path
|
|
from audiocraft.models import AudioGen
|
|
import torchaudio
|
|
|
|
def batch_generate_sounds(sound_specs, output_dir):
|
|
"""
|
|
Generate multiple sounds from specifications.
|
|
|
|
Args:
|
|
sound_specs: list of {"name": str, "description": str, "duration": float}
|
|
output_dir: output directory path
|
|
"""
|
|
model = AudioGen.get_pretrained('facebook/audiogen-medium')
|
|
output_dir = Path(output_dir)
|
|
output_dir.mkdir(exist_ok=True)
|
|
|
|
results = []
|
|
|
|
for spec in sound_specs:
|
|
model.set_generation_params(duration=spec.get("duration", 5))
|
|
|
|
wav = model.generate([spec["description"]])
|
|
|
|
output_path = output_dir / f"{spec['name']}.wav"
|
|
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
|
|
|
|
results.append({
|
|
"name": spec["name"],
|
|
"path": str(output_path),
|
|
"description": spec["description"]
|
|
})
|
|
|
|
return results
|
|
|
|
# Usage
|
|
sounds = [
|
|
{"name": "explosion", "description": "massive explosion with debris", "duration": 3},
|
|
{"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
|
|
{"name": "door", "description": "wooden door creaking and closing", "duration": 2}
|
|
]
|
|
|
|
results = batch_generate_sounds(sounds, "sound_effects/")
|
|
```
|
|
|
|
### Workflow 3: Gradio demo
|
|
|
|
```python
|
|
import gradio as gr
|
|
import torch
|
|
import torchaudio
|
|
from audiocraft.models import MusicGen
|
|
|
|
model = MusicGen.get_pretrained('facebook/musicgen-small')
|
|
|
|
def generate_music(prompt, duration, temperature, cfg_coef):
|
|
model.set_generation_params(
|
|
duration=duration,
|
|
temperature=temperature,
|
|
cfg_coef=cfg_coef
|
|
)
|
|
|
|
with torch.no_grad():
|
|
wav = model.generate([prompt])
|
|
|
|
# Save to temp file
|
|
path = "temp_output.wav"
|
|
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
|
|
return path
|
|
|
|
demo = gr.Interface(
|
|
fn=generate_music,
|
|
inputs=[
|
|
gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
|
|
gr.Slider(1, 30, value=8, label="Duration (seconds)"),
|
|
gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
|
|
gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
|
|
],
|
|
outputs=gr.Audio(label="Generated Music"),
|
|
title="MusicGen Demo"
|
|
)
|
|
|
|
demo.launch()
|
|
```
|
|
|
|
## Performance optimization
|
|
|
|
### Memory optimization
|
|
|
|
```python
|
|
# Use smaller model
|
|
model = MusicGen.get_pretrained('facebook/musicgen-small')
|
|
|
|
# Clear cache between generations
|
|
torch.cuda.empty_cache()
|
|
|
|
# Generate shorter durations
|
|
model.set_generation_params(duration=10) # Instead of 30
|
|
|
|
# Use half precision
|
|
model = model.half()
|
|
```
|
|
|
|
### Batch processing efficiency
|
|
|
|
```python
|
|
# Process multiple prompts at once (more efficient)
|
|
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
|
|
wav = model.generate(descriptions) # Single batch
|
|
|
|
# Instead of
|
|
for desc in descriptions:
|
|
wav = model.generate([desc]) # Multiple batches (slower)
|
|
```
|
|
|
|
### GPU memory requirements
|
|
|
|
| Model | FP32 VRAM | FP16 VRAM |
|
|
|-------|-----------|-----------|
|
|
| musicgen-small | ~4GB | ~2GB |
|
|
| musicgen-medium | ~8GB | ~4GB |
|
|
| musicgen-large | ~16GB | ~8GB |
|
|
|
|
## Common issues
|
|
|
|
| Issue | Solution |
|
|
|-------|----------|
|
|
| CUDA OOM | Use smaller model, reduce duration |
|
|
| Poor quality | Increase cfg_coef, better prompts |
|
|
| Generation too short | Check max duration setting |
|
|
| Audio artifacts | Try different temperature |
|
|
| Stereo not working | Use stereo model variant |
|
|
|
|
## References
|
|
|
|
- **[Advanced Usage](references/advanced-usage.md)** - Training, fine-tuning, deployment
|
|
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
|
|
|
|
## Resources
|
|
|
|
- **GitHub**: https://github.com/facebookresearch/audiocraft
|
|
- **Paper (MusicGen)**: https://arxiv.org/abs/2306.05284
|
|
- **Paper (AudioGen)**: https://arxiv.org/abs/2209.15352
|
|
- **HuggingFace**: https://huggingface.co/facebook/musicgen-small
|
|
- **Demo**: https://huggingface.co/spaces/facebook/MusicGen
|