Sync all skills and memories 2026-04-14 07:27
This commit is contained in:
170
skills/media/heartmula/SKILL.md
Normal file
170
skills/media/heartmula/SKILL.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
name: heartmula
|
||||
description: Set up and run HeartMuLa, the open-source music generation model family (Suno-like). Generates full songs from lyrics + tags with multilingual support.
|
||||
version: 1.0.0
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [music, audio, generation, ai, heartmula, heartcodec, lyrics, songs]
|
||||
related_skills: [audiocraft]
|
||||
---
|
||||
|
||||
# HeartMuLa - Open-Source Music Generation
|
||||
|
||||
## Overview
|
||||
HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags. Comparable to Suno for open-source. Includes:
|
||||
- **HeartMuLa** - Music language model (3B/7B) for generation from lyrics + tags
|
||||
- **HeartCodec** - 12.5Hz music codec for high-fidelity audio reconstruction
|
||||
- **HeartTranscriptor** - Whisper-based lyrics transcription
|
||||
- **HeartCLAP** - Audio-text alignment model
|
||||
|
||||
## When to Use
|
||||
- User wants to generate music/songs from text descriptions
|
||||
- User wants an open-source Suno alternative
|
||||
- User wants local/offline music generation
|
||||
- User asks about HeartMuLa, heartlib, or AI music generation
|
||||
|
||||
## Hardware Requirements
|
||||
- **Minimum**: 8GB VRAM with `--lazy_load true` (loads/unloads models sequentially)
|
||||
- **Recommended**: 16GB+ VRAM for comfortable single-GPU usage
|
||||
- **Multi-GPU**: Use `--mula_device cuda:0 --codec_device cuda:1` to split across GPUs
|
||||
- 3B model with lazy_load peaks at ~6.2GB VRAM
|
||||
|
||||
## Installation Steps
|
||||
|
||||
### 1. Clone Repository
|
||||
```bash
|
||||
cd ~/ # or desired directory
|
||||
git clone https://github.com/HeartMuLa/heartlib.git
|
||||
cd heartlib
|
||||
```
|
||||
|
||||
### 2. Create Virtual Environment (Python 3.10 required)
|
||||
```bash
|
||||
uv venv --python 3.10 .venv
|
||||
. .venv/bin/activate
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
### 3. Fix Dependency Compatibility Issues
|
||||
|
||||
**IMPORTANT**: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:
|
||||
|
||||
```bash
|
||||
# Upgrade datasets (old version incompatible with current pyarrow)
|
||||
uv pip install --upgrade datasets
|
||||
|
||||
# Upgrade transformers (needed for huggingface-hub 1.x compatibility)
|
||||
uv pip install --upgrade transformers
|
||||
```
|
||||
|
||||
### 4. Patch Source Code (Required for transformers 5.x)
|
||||
|
||||
**Patch 1 - RoPE cache fix** in `src/heartlib/heartmula/modeling_heartmula.py`:
|
||||
|
||||
In the `setup_caches` method of the `HeartMuLa` class, add RoPE reinitialization after the `reset_caches` try/except block and before the `with device:` block:
|
||||
|
||||
```python
|
||||
# Re-initialize RoPE caches that were skipped during meta-device loading
|
||||
from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
|
||||
for module in self.modules():
|
||||
if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
|
||||
module.rope_init()
|
||||
module.to(device)
|
||||
```
|
||||
|
||||
**Why**: `from_pretrained` creates model on meta device first; `Llama3ScaledRoPE.rope_init()` skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.
|
||||
|
||||
**Patch 2 - HeartCodec loading fix** in `src/heartlib/pipelines/music_generation.py`:
|
||||
|
||||
Add `ignore_mismatched_sizes=True` to ALL `HeartCodec.from_pretrained()` calls (there are 2: the eager load in `__init__` and the lazy load in the `codec` property).
|
||||
|
||||
**Why**: VQ codebook `initted` buffers have shape `[1]` in checkpoint vs `[]` in model. Same data, just scalar vs 0-d tensor. Safe to ignore.
|
||||
|
||||
### 5. Download Model Checkpoints
|
||||
```bash
|
||||
cd heartlib # project root
|
||||
hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
|
||||
hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'
|
||||
hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'
|
||||
```
|
||||
|
||||
All 3 can be downloaded in parallel. Total size is several GB.
|
||||
|
||||
## GPU / CUDA
|
||||
|
||||
HeartMuLa uses CUDA by default (`--mula_device cuda --codec_device cuda`). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.
|
||||
|
||||
- The installed `torch==2.4.1` includes CUDA 12.1 support out of the box
|
||||
- `torchtune` may report version `0.4.0+cpu` — this is just package metadata, it still uses CUDA via PyTorch
|
||||
- To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")
|
||||
- **No GPU?** You can run on CPU with `--mula_device cpu --codec_device cpu`, but expect generation to be **extremely slow** (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Generation
|
||||
```bash
|
||||
cd heartlib
|
||||
. .venv/bin/activate
|
||||
python ./examples/run_music_generation.py \
|
||||
--model_path=./ckpt \
|
||||
--version="3B" \
|
||||
--lyrics="./assets/lyrics.txt" \
|
||||
--tags="./assets/tags.txt" \
|
||||
--save_path="./assets/output.mp3" \
|
||||
--lazy_load true
|
||||
```
|
||||
|
||||
### Input Formatting
|
||||
|
||||
**Tags** (comma-separated, no spaces):
|
||||
```
|
||||
piano,happy,wedding,synthesizer,romantic
|
||||
```
|
||||
or
|
||||
```
|
||||
rock,energetic,guitar,drums,male-vocal
|
||||
```
|
||||
|
||||
**Lyrics** (use bracketed structural tags):
|
||||
```
|
||||
[Intro]
|
||||
|
||||
[Verse]
|
||||
Your lyrics here...
|
||||
|
||||
[Chorus]
|
||||
Chorus lyrics...
|
||||
|
||||
[Bridge]
|
||||
Bridge lyrics...
|
||||
|
||||
[Outro]
|
||||
```
|
||||
|
||||
### Key Parameters
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `--max_audio_length_ms` | 240000 | Max length in ms (240s = 4 min) |
|
||||
| `--topk` | 50 | Top-k sampling |
|
||||
| `--temperature` | 1.0 | Sampling temperature |
|
||||
| `--cfg_scale` | 1.5 | Classifier-free guidance scale |
|
||||
| `--lazy_load` | false | Load/unload models on demand (saves VRAM) |
|
||||
| `--mula_dtype` | bfloat16 | Dtype for HeartMuLa (bf16 recommended) |
|
||||
| `--codec_dtype` | float32 | Dtype for HeartCodec (fp32 recommended for quality) |
|
||||
|
||||
### Performance
|
||||
- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
|
||||
- Output: MP3, 48kHz stereo, 128kbps
|
||||
|
||||
## Pitfalls
|
||||
1. **Do NOT use bf16 for HeartCodec** — degrades audio quality. Use fp32 (default).
|
||||
2. **Tags may be ignored** — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
|
||||
3. **Triton not available on macOS** — Linux/CUDA only for GPU acceleration.
|
||||
4. **RTX 5080 incompatibility** reported in upstream issues.
|
||||
5. The dependency pin conflicts require the manual upgrades and patches described above.
|
||||
|
||||
## Links
|
||||
- Repo: https://github.com/HeartMuLa/heartlib
|
||||
- Models: https://huggingface.co/HeartMuLa
|
||||
- Paper: https://arxiv.org/abs/2601.10547
|
||||
- License: Apache-2.0
|
||||
Reference in New Issue
Block a user