Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/media/DESCRIPTION.md
+++ b/skills/media/DESCRIPTION.md
@@ -0,0 +1,3 @@
+---
+description: Skills for working with media content — YouTube transcripts, GIF search, music generation, and audio visualization.
+---
--- a/skills/media/gif-search/SKILL.md
+++ b/skills/media/gif-search/SKILL.md
@@ -0,0 +1,86 @@
+---
+name: gif-search
+description: Search and download GIFs from Tenor using curl. No dependencies beyond curl and jq. Useful for finding reaction GIFs, creating visual content, and sending GIFs in chat.
+version: 1.1.0
+author: Hermes Agent
+license: MIT
+prerequisites:
+  env_vars: [TENOR_API_KEY]
+  commands: [curl, jq]
+metadata:
+  hermes:
+    tags: [GIF, Media, Search, Tenor, API]
+---
+
+# GIF Search (Tenor API)
+
+Search and download GIFs directly via the Tenor API using curl. No extra tools needed.
+
+## Setup
+
+Set your Tenor API key in your environment (add to `~/.hermes/.env`):
+
+```bash
+TENOR_API_KEY=your_key_here
+```
+
+Get a free API key at https://developers.google.com/tenor/guides/quickstart — the Google Cloud Console Tenor API key is free and has generous rate limits.
+
+## Prerequisites
+
+- `curl` and `jq` (both standard on macOS/Linux)
+- `TENOR_API_KEY` environment variable
+
+## Search for GIFs
+
+```bash
+# Search and get GIF URLs
+curl -s "https://tenor.googleapis.com/v2/search?q=thumbs+up&limit=5&key=${TENOR_API_KEY}" | jq -r '.results[].media_formats.gif.url'
+
+# Get smaller/preview versions
+curl -s "https://tenor.googleapis.com/v2/search?q=nice+work&limit=3&key=${TENOR_API_KEY}" | jq -r '.results[].media_formats.tinygif.url'
+```
+
+## Download a GIF
+
+```bash
+# Search and download the top result
+URL=$(curl -s "https://tenor.googleapis.com/v2/search?q=celebration&limit=1&key=${TENOR_API_KEY}" | jq -r '.results[0].media_formats.gif.url')
+curl -sL "$URL" -o celebration.gif
+```
+
+## Get Full Metadata
+
+```bash
+curl -s "https://tenor.googleapis.com/v2/search?q=cat&limit=3&key=${TENOR_API_KEY}" | jq '.results[] | {title: .title, url: .media_formats.gif.url, preview: .media_formats.tinygif.url, dimensions: .media_formats.gif.dims}'
+```
+
+## API Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `q` | Search query (URL-encode spaces as `+`) |
+| `limit` | Max results (1-50, default 20) |
+| `key` | API key (from `$TENOR_API_KEY` env var) |
+| `media_filter` | Filter formats: `gif`, `tinygif`, `mp4`, `tinymp4`, `webm` |
+| `contentfilter` | Safety: `off`, `low`, `medium`, `high` |
+| `locale` | Language: `en_US`, `es`, `fr`, etc. |
+
+## Available Media Formats
+
+Each result has multiple formats under `.media_formats`:
+
+| Format | Use case |
+|--------|----------|
+| `gif` | Full quality GIF |
+| `tinygif` | Small preview GIF |
+| `mp4` | Video version (smaller file size) |
+| `tinymp4` | Small preview video |
+| `webm` | WebM video |
+| `nanogif` | Tiny thumbnail |
+
+## Notes
+
+- URL-encode the query: spaces as `+`, special chars as `%XX`
+- For sending in chat, `tinygif` URLs are lighter weight
+- GIF URLs can be used directly in markdown: `![alt](url)`
--- a/skills/media/heartmula/SKILL.md
+++ b/skills/media/heartmula/SKILL.md
@@ -0,0 +1,170 @@
+---
+name: heartmula
+description: Set up and run HeartMuLa, the open-source music generation model family (Suno-like). Generates full songs from lyrics + tags with multilingual support.
+version: 1.0.0
+metadata:
+  hermes:
+    tags: [music, audio, generation, ai, heartmula, heartcodec, lyrics, songs]
+    related_skills: [audiocraft]
+---
+
+# HeartMuLa - Open-Source Music Generation
+
+## Overview
+HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags. Comparable to Suno for open-source. Includes:
+- **HeartMuLa** - Music language model (3B/7B) for generation from lyrics + tags
+- **HeartCodec** - 12.5Hz music codec for high-fidelity audio reconstruction
+- **HeartTranscriptor** - Whisper-based lyrics transcription
+- **HeartCLAP** - Audio-text alignment model
+
+## When to Use
+- User wants to generate music/songs from text descriptions
+- User wants an open-source Suno alternative
+- User wants local/offline music generation
+- User asks about HeartMuLa, heartlib, or AI music generation
+
+## Hardware Requirements
+- **Minimum**: 8GB VRAM with `--lazy_load true` (loads/unloads models sequentially)
+- **Recommended**: 16GB+ VRAM for comfortable single-GPU usage
+- **Multi-GPU**: Use `--mula_device cuda:0 --codec_device cuda:1` to split across GPUs
+- 3B model with lazy_load peaks at ~6.2GB VRAM
+
+## Installation Steps
+
+### 1. Clone Repository
+```bash
+cd ~/  # or desired directory
+git clone https://github.com/HeartMuLa/heartlib.git
+cd heartlib
+```
+
+### 2. Create Virtual Environment (Python 3.10 required)
+```bash
+uv venv --python 3.10 .venv
+. .venv/bin/activate
+uv pip install -e .
+```
+
+### 3. Fix Dependency Compatibility Issues
+
+**IMPORTANT**: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:
+
+```bash
+# Upgrade datasets (old version incompatible with current pyarrow)
+uv pip install --upgrade datasets
+
+# Upgrade transformers (needed for huggingface-hub 1.x compatibility)
+uv pip install --upgrade transformers
+```
+
+### 4. Patch Source Code (Required for transformers 5.x)
+
+**Patch 1 - RoPE cache fix** in `src/heartlib/heartmula/modeling_heartmula.py`:
+
+In the `setup_caches` method of the `HeartMuLa` class, add RoPE reinitialization after the `reset_caches` try/except block and before the `with device:` block:
+
+```python
+# Re-initialize RoPE caches that were skipped during meta-device loading
+from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
+for module in self.modules():
+    if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
+        module.rope_init()
+        module.to(device)
+```
+
+**Why**: `from_pretrained` creates model on meta device first; `Llama3ScaledRoPE.rope_init()` skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.
+
+**Patch 2 - HeartCodec loading fix** in `src/heartlib/pipelines/music_generation.py`:
+
+Add `ignore_mismatched_sizes=True` to ALL `HeartCodec.from_pretrained()` calls (there are 2: the eager load in `__init__` and the lazy load in the `codec` property).
+
+**Why**: VQ codebook `initted` buffers have shape `[1]` in checkpoint vs `[]` in model. Same data, just scalar vs 0-d tensor. Safe to ignore.
+
+### 5. Download Model Checkpoints
+```bash
+cd heartlib  # project root
+hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
+hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'
+hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'
+```
+
+All 3 can be downloaded in parallel. Total size is several GB.
+
+## GPU / CUDA
+
+HeartMuLa uses CUDA by default (`--mula_device cuda --codec_device cuda`). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.
+
+- The installed `torch==2.4.1` includes CUDA 12.1 support out of the box
+- `torchtune` may report version `0.4.0+cpu` — this is just package metadata, it still uses CUDA via PyTorch
+- To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")
+- **No GPU?** You can run on CPU with `--mula_device cpu --codec_device cpu`, but expect generation to be **extremely slow** (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.
+
+## Usage
+
+### Basic Generation
+```bash
+cd heartlib
+. .venv/bin/activate
+python ./examples/run_music_generation.py \
+  --model_path=./ckpt \
+  --version="3B" \
+  --lyrics="./assets/lyrics.txt" \
+  --tags="./assets/tags.txt" \
+  --save_path="./assets/output.mp3" \
+  --lazy_load true
+```
+
+### Input Formatting
+
+**Tags** (comma-separated, no spaces):
+```
+piano,happy,wedding,synthesizer,romantic
+```
+or
+```
+rock,energetic,guitar,drums,male-vocal
+```
+
+**Lyrics** (use bracketed structural tags):
+```
+[Intro]
+
+[Verse]
+Your lyrics here...
+
+[Chorus]
+Chorus lyrics...
+
+[Bridge]
+Bridge lyrics...
+
+[Outro]
+```
+
+### Key Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--max_audio_length_ms` | 240000 | Max length in ms (240s = 4 min) |
+| `--topk` | 50 | Top-k sampling |
+| `--temperature` | 1.0 | Sampling temperature |
+| `--cfg_scale` | 1.5 | Classifier-free guidance scale |
+| `--lazy_load` | false | Load/unload models on demand (saves VRAM) |
+| `--mula_dtype` | bfloat16 | Dtype for HeartMuLa (bf16 recommended) |
+| `--codec_dtype` | float32 | Dtype for HeartCodec (fp32 recommended for quality) |
+
+### Performance
+- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
+- Output: MP3, 48kHz stereo, 128kbps
+
+## Pitfalls
+1. **Do NOT use bf16 for HeartCodec** — degrades audio quality. Use fp32 (default).
+2. **Tags may be ignored** — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
+3. **Triton not available on macOS** — Linux/CUDA only for GPU acceleration.
+4. **RTX 5080 incompatibility** reported in upstream issues.
+5. The dependency pin conflicts require the manual upgrades and patches described above.
+
+## Links
+- Repo: https://github.com/HeartMuLa/heartlib
+- Models: https://huggingface.co/HeartMuLa
+- Paper: https://arxiv.org/abs/2601.10547
+- License: Apache-2.0
--- a/skills/media/songsee/SKILL.md
+++ b/skills/media/songsee/SKILL.md
@@ -0,0 +1,82 @@
+---
+name: songsee
+description: Generate spectrograms and audio feature visualizations (mel, chroma, MFCC, tempogram, etc.) from audio files via CLI. Useful for audio analysis, music production debugging, and visual documentation.
+version: 1.0.0
+author: community
+license: MIT
+metadata:
+  hermes:
+    tags: [Audio, Visualization, Spectrogram, Music, Analysis]
+    homepage: https://github.com/steipete/songsee
+prerequisites:
+  commands: [songsee]
+---
+
+# songsee
+
+Generate spectrograms and multi-panel audio feature visualizations from audio files.
+
+## Prerequisites
+
+Requires [Go](https://go.dev/doc/install):
+```bash
+go install github.com/steipete/songsee/cmd/songsee@latest
+```
+
+Optional: `ffmpeg` for formats beyond WAV/MP3.
+
+## Quick Start
+
+```bash
+# Basic spectrogram
+songsee track.mp3
+
+# Save to specific file
+songsee track.mp3 -o spectrogram.png
+
+# Multi-panel visualization grid
+songsee track.mp3 --viz spectrogram,mel,chroma,hpss,selfsim,loudness,tempogram,mfcc,flux
+
+# Time slice (start at 12.5s, 8s duration)
+songsee track.mp3 --start 12.5 --duration 8 -o slice.jpg
+
+# From stdin
+cat track.mp3 | songsee - --format png -o out.png
+```
+
+## Visualization Types
+
+Use `--viz` with comma-separated values:
+
+| Type | Description |
+|------|-------------|
+| `spectrogram` | Standard frequency spectrogram |
+| `mel` | Mel-scaled spectrogram |
+| `chroma` | Pitch class distribution |
+| `hpss` | Harmonic/percussive separation |
+| `selfsim` | Self-similarity matrix |
+| `loudness` | Loudness over time |
+| `tempogram` | Tempo estimation |
+| `mfcc` | Mel-frequency cepstral coefficients |
+| `flux` | Spectral flux (onset detection) |
+
+Multiple `--viz` types render as a grid in a single image.
+
+## Common Flags
+
+| Flag | Description |
+|------|-------------|
+| `--viz` | Visualization types (comma-separated) |
+| `--style` | Color palette: `classic`, `magma`, `inferno`, `viridis`, `gray` |
+| `--width` / `--height` | Output image dimensions |
+| `--window` / `--hop` | FFT window and hop size |
+| `--min-freq` / `--max-freq` | Frequency range filter |
+| `--start` / `--duration` | Time slice of the audio |
+| `--format` | Output format: `jpg` or `png` |
+| `-o` | Output file path |
+
+## Notes
+
+- WAV and MP3 are decoded natively; other formats require `ffmpeg`
+- Output images can be inspected with `vision_analyze` for automated audio analysis
+- Useful for comparing audio outputs, debugging synthesis, or documenting audio processing pipelines
--- a/skills/media/youtube-content/SKILL.md
+++ b/skills/media/youtube-content/SKILL.md
@@ -0,0 +1,72 @@
+---
+name: youtube-content
+description: >
+  Fetch YouTube video transcripts and transform them into structured content
+  (chapters, summaries, threads, blog posts). Use when the user shares a YouTube
+  URL or video link, asks to summarize a video, requests a transcript, or wants
+  to extract and reformat content from any YouTube video.
+---
+
+# YouTube Content Tool
+
+Extract transcripts from YouTube videos and convert them into useful formats.
+
+## Setup
+
+```bash
+pip install youtube-transcript-api
+```
+
+## Helper Script
+
+`SKILL_DIR` is the directory containing this SKILL.md file. The script accepts any standard YouTube URL format, short links (youtu.be), shorts, embeds, live links, or a raw 11-character video ID.
+
+```bash
+# JSON output with metadata
+python3 SKILL_DIR/scripts/fetch_transcript.py "https://youtube.com/watch?v=VIDEO_ID"
+
+# Plain text (good for piping into further processing)
+python3 SKILL_DIR/scripts/fetch_transcript.py "URL" --text-only
+
+# With timestamps
+python3 SKILL_DIR/scripts/fetch_transcript.py "URL" --timestamps
+
+# Specific language with fallback chain
+python3 SKILL_DIR/scripts/fetch_transcript.py "URL" --language tr,en
+```
+
+## Output Formats
+
+After fetching the transcript, format it based on what the user asks for:
+
+- **Chapters**: Group by topic shifts, output timestamped chapter list
+- **Summary**: Concise 5-10 sentence overview of the entire video
+- **Chapter summaries**: Chapters with a short paragraph summary for each
+- **Thread**: Twitter/X thread format — numbered posts, each under 280 chars
+- **Blog post**: Full article with title, sections, and key takeaways
+- **Quotes**: Notable quotes with timestamps
+
+### Example — Chapters Output
+
+```
+00:00 Introduction — host opens with the problem statement
+03:45 Background — prior work and why existing solutions fall short
+12:20 Core method — walkthrough of the proposed approach
+24:10 Results — benchmark comparisons and key takeaways
+31:55 Q&A — audience questions on scalability and next steps
+```
+
+## Workflow
+
+1. **Fetch** the transcript using the helper script with `--text-only --timestamps`.
+2. **Validate**: confirm the output is non-empty and in the expected language. If empty, retry without `--language` to get any available transcript. If still empty, tell the user the video likely has transcripts disabled.
+3. **Chunk if needed**: if the transcript exceeds ~50K characters, split into overlapping chunks (~40K with 2K overlap) and summarize each chunk before merging.
+4. **Transform** into the requested output format. If the user did not specify a format, default to a summary.
+5. **Verify**: re-read the transformed output to check for coherence, correct timestamps, and completeness before presenting.
+
+## Error Handling
+
+- **Transcript disabled**: tell the user; suggest they check if subtitles are available on the video page.
+- **Private/unavailable video**: relay the error and ask the user to verify the URL.
+- **No matching language**: retry without `--language` to fetch any available transcript, then note the actual language to the user.
+- **Dependency missing**: run `pip install youtube-transcript-api` and retry.
--- a/skills/media/youtube-content/references/output-formats.md
+++ b/skills/media/youtube-content/references/output-formats.md
@@ -0,0 +1,56 @@
+# Output Format Examples
+
+## Chapters
+
+```
+00:00 Introduction
+02:15 Background and motivation
+05:30 Main approach
+12:45 Results and evaluation
+18:20 Limitations and future work
+21:00 Q&A
+```
+
+## Summary
+
+A 5-10 sentence overview covering the video's main points, key arguments, and conclusions. Written in third person, present tense.
+
+## Chapter Summaries
+
+```
+## 00:00 Introduction (2 min)
+The speaker introduces the topic of X and explains why it matters for Y.
+
+## 02:15 Background (3 min)
+A review of prior work in the field, covering approaches A, B, and C.
+```
+
+## Thread (Twitter/X)
+
+```
+1/ Just watched an incredible talk on [topic]. Here are the key takeaways: 🧵
+
+2/ First insight: [point]. This matters because [reason].
+
+3/ The surprising part: [unexpected finding]. Most people assume [common belief], but the data shows otherwise.
+
+4/ Practical takeaway: [actionable advice].
+
+5/ Full video: [URL]
+```
+
+## Blog Post
+
+Full article with:
+- Title
+- Introduction paragraph
+- H2 sections for each major topic
+- Key quotes (with timestamps)
+- Conclusion / takeaways
+
+## Quotes
+
+```
+"The most important thing is not the model size, but the data quality." — 05:32
+"We found that scaling past 70B parameters gave diminishing returns." — 12:18
+```
--- a/skills/media/youtube-content/scripts/fetch_transcript.py
+++ b/skills/media/youtube-content/scripts/fetch_transcript.py
@@ -0,0 +1,124 @@
+#!/usr/bin/env python3
+"""
+Fetch a YouTube video transcript and output it as structured JSON.
+
+Usage:
+    python fetch_transcript.py <url_or_video_id> [--language en,tr] [--timestamps]
+
+Output (JSON):
+    {
+        "video_id": "...",
+        "language": "en",
+        "segments": [{"text": "...", "start": 0.0, "duration": 2.5}, ...],
+        "full_text": "complete transcript as plain text",
+        "timestamped_text": "00:00 first line\n00:05 second line\n..."
+    }
+
+Install dependency:  pip install youtube-transcript-api
+"""
+
+import argparse
+import json
+import re
+import sys
+
+
+def extract_video_id(url_or_id: str) -> str:
+    """Extract the 11-character video ID from various YouTube URL formats."""
+    url_or_id = url_or_id.strip()
+    patterns = [
+        r'(?:v=|youtu\.be/|shorts/|embed/|live/)([a-zA-Z0-9_-]{11})',
+        r'^([a-zA-Z0-9_-]{11})$',
+    ]
+    for pattern in patterns:
+        match = re.search(pattern, url_or_id)
+        if match:
+            return match.group(1)
+    return url_or_id
+
+
+def format_timestamp(seconds: float) -> str:
+    """Convert seconds to HH:MM:SS or MM:SS format."""
+    total = int(seconds)
+    h, remainder = divmod(total, 3600)
+    m, s = divmod(remainder, 60)
+    if h > 0:
+        return f"{h}:{m:02d}:{s:02d}"
+    return f"{m}:{s:02d}"
+
+
+def fetch_transcript(video_id: str, languages: list = None):
+    """Fetch transcript segments from YouTube.
+
+    Returns a list of dicts with 'text', 'start', and 'duration' keys.
+    Compatible with youtube-transcript-api v1.x.
+    """
+    try:
+        from youtube_transcript_api import YouTubeTranscriptApi
+    except ImportError:
+        print("Error: youtube-transcript-api not installed. Run: pip install youtube-transcript-api",
+              file=sys.stderr)
+        sys.exit(1)
+
+    api = YouTubeTranscriptApi()
+    if languages:
+        result = api.fetch(video_id, languages=languages)
+    else:
+        result = api.fetch(video_id)
+
+    # v1.x returns FetchedTranscriptSnippet objects; normalize to dicts
+    return [
+        {"text": seg.text, "start": seg.start, "duration": seg.duration}
+        for seg in result
+    ]
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Fetch YouTube transcript as JSON")
+    parser.add_argument("url", help="YouTube URL or video ID")
+    parser.add_argument("--language", "-l", default=None,
+                        help="Comma-separated language codes (e.g. en,tr). Default: auto")
+    parser.add_argument("--timestamps", "-t", action="store_true",
+                        help="Include timestamped text in output")
+    parser.add_argument("--text-only", action="store_true",
+                        help="Output plain text instead of JSON")
+    args = parser.parse_args()
+
+    video_id = extract_video_id(args.url)
+    languages = [l.strip() for l in args.language.split(",")] if args.language else None
+
+    try:
+        segments = fetch_transcript(video_id, languages)
+    except Exception as e:
+        error_msg = str(e)
+        if "disabled" in error_msg.lower():
+            print(json.dumps({"error": "Transcripts are disabled for this video."}))
+        elif "no transcript" in error_msg.lower():
+            print(json.dumps({"error": f"No transcript found. Try specifying a language with --language."}))
+        else:
+            print(json.dumps({"error": error_msg}))
+        sys.exit(1)
+
+    full_text = " ".join(seg["text"] for seg in segments)
+    timestamped = "\n".join(
+        f"{format_timestamp(seg['start'])} {seg['text']}" for seg in segments
+    )
+
+    if args.text_only:
+        print(timestamped if args.timestamps else full_text)
+        return
+
+    result = {
+        "video_id": video_id,
+        "segment_count": len(segments),
+        "duration": format_timestamp(segments[-1]["start"] + segments[-1]["duration"]) if segments else "0:00",
+        "full_text": full_text,
+    }
+    if args.timestamps:
+        result["timestamped_text"] = timestamped
+
+    print(json.dumps(result, ensure_ascii=False, indent=2))
+
+
+if __name__ == "__main__":
+    main()