Sync all skills and memories 2026-04-14 07:27
This commit is contained in:
3
skills/mlops/cloud/DESCRIPTION.md
Normal file
3
skills/mlops/cloud/DESCRIPTION.md
Normal file
@@ -0,0 +1,3 @@
|
||||
---
|
||||
description: GPU cloud providers and serverless compute platforms for ML workloads.
|
||||
---
|
||||
344
skills/mlops/cloud/modal/SKILL.md
Normal file
344
skills/mlops/cloud/modal/SKILL.md
Normal file
@@ -0,0 +1,344 @@
|
||||
---
|
||||
name: modal-serverless-gpu
|
||||
description: Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [modal>=0.64.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Infrastructure, Serverless, GPU, Cloud, Deployment, Modal]
|
||||
|
||||
---
|
||||
|
||||
# Modal Serverless GPU
|
||||
|
||||
Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.
|
||||
|
||||
## When to use Modal
|
||||
|
||||
**Use Modal when:**
|
||||
- Running GPU-intensive ML workloads without managing infrastructure
|
||||
- Deploying ML models as auto-scaling APIs
|
||||
- Running batch processing jobs (training, inference, data processing)
|
||||
- Need pay-per-second GPU pricing without idle costs
|
||||
- Prototyping ML applications quickly
|
||||
- Running scheduled jobs (cron-like workloads)
|
||||
|
||||
**Key features:**
|
||||
- **Serverless GPUs**: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
|
||||
- **Python-native**: Define infrastructure in Python code, no YAML
|
||||
- **Auto-scaling**: Scale to zero, scale to 100+ GPUs instantly
|
||||
- **Sub-second cold starts**: Rust-based infrastructure for fast container launches
|
||||
- **Container caching**: Image layers cached for rapid iteration
|
||||
- **Web endpoints**: Deploy functions as REST APIs with zero-downtime updates
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **RunPod**: For longer-running pods with persistent state
|
||||
- **Lambda Labs**: For reserved GPU instances
|
||||
- **SkyPilot**: For multi-cloud orchestration and cost optimization
|
||||
- **Kubernetes**: For complex multi-service architectures
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install modal
|
||||
modal setup # Opens browser for authentication
|
||||
```
|
||||
|
||||
### Hello World with GPU
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("hello-gpu")
|
||||
|
||||
@app.function(gpu="T4")
|
||||
def gpu_info():
|
||||
import subprocess
|
||||
return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout
|
||||
|
||||
@app.local_entrypoint()
|
||||
def main():
|
||||
print(gpu_info.remote())
|
||||
```
|
||||
|
||||
Run: `modal run hello_gpu.py`
|
||||
|
||||
### Basic inference endpoint
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("text-generation")
|
||||
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
|
||||
|
||||
@app.cls(gpu="A10G", image=image)
|
||||
class TextGenerator:
|
||||
@modal.enter()
|
||||
def load_model(self):
|
||||
from transformers import pipeline
|
||||
self.pipe = pipeline("text-generation", model="gpt2", device=0)
|
||||
|
||||
@modal.method()
|
||||
def generate(self, prompt: str) -> str:
|
||||
return self.pipe(prompt, max_length=100)[0]["generated_text"]
|
||||
|
||||
@app.local_entrypoint()
|
||||
def main():
|
||||
print(TextGenerator().generate.remote("Hello, world"))
|
||||
```
|
||||
|
||||
## Core concepts
|
||||
|
||||
### Key components
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| `App` | Container for functions and resources |
|
||||
| `Function` | Serverless function with compute specs |
|
||||
| `Cls` | Class-based functions with lifecycle hooks |
|
||||
| `Image` | Container image definition |
|
||||
| `Volume` | Persistent storage for models/data |
|
||||
| `Secret` | Secure credential storage |
|
||||
|
||||
### Execution modes
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `modal run script.py` | Execute and exit |
|
||||
| `modal serve script.py` | Development with live reload |
|
||||
| `modal deploy script.py` | Persistent cloud deployment |
|
||||
|
||||
## GPU configuration
|
||||
|
||||
### Available GPUs
|
||||
|
||||
| GPU | VRAM | Best For |
|
||||
|-----|------|----------|
|
||||
| `T4` | 16GB | Budget inference, small models |
|
||||
| `L4` | 24GB | Inference, Ada Lovelace arch |
|
||||
| `A10G` | 24GB | Training/inference, 3.3x faster than T4 |
|
||||
| `L40S` | 48GB | Recommended for inference (best cost/perf) |
|
||||
| `A100-40GB` | 40GB | Large model training |
|
||||
| `A100-80GB` | 80GB | Very large models |
|
||||
| `H100` | 80GB | Fastest, FP8 + Transformer Engine |
|
||||
| `H200` | 141GB | Auto-upgrade from H100, 4.8TB/s bandwidth |
|
||||
| `B200` | Latest | Blackwell architecture |
|
||||
|
||||
### GPU specification patterns
|
||||
|
||||
```python
|
||||
# Single GPU
|
||||
@app.function(gpu="A100")
|
||||
|
||||
# Specific memory variant
|
||||
@app.function(gpu="A100-80GB")
|
||||
|
||||
# Multiple GPUs (up to 8)
|
||||
@app.function(gpu="H100:4")
|
||||
|
||||
# GPU with fallbacks
|
||||
@app.function(gpu=["H100", "A100", "L40S"])
|
||||
|
||||
# Any available GPU
|
||||
@app.function(gpu="any")
|
||||
```
|
||||
|
||||
## Container images
|
||||
|
||||
```python
|
||||
# Basic image with pip
|
||||
image = modal.Image.debian_slim(python_version="3.11").pip_install(
|
||||
"torch==2.1.0", "transformers==4.36.0", "accelerate"
|
||||
)
|
||||
|
||||
# From CUDA base
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
|
||||
add_python="3.11"
|
||||
).pip_install("torch", "transformers")
|
||||
|
||||
# With system packages
|
||||
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
|
||||
```
|
||||
|
||||
## Persistent storage
|
||||
|
||||
```python
|
||||
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
|
||||
|
||||
@app.function(gpu="A10G", volumes={"/models": volume})
|
||||
def load_model():
|
||||
import os
|
||||
model_path = "/models/llama-7b"
|
||||
if not os.path.exists(model_path):
|
||||
model = download_model()
|
||||
model.save_pretrained(model_path)
|
||||
volume.commit() # Persist changes
|
||||
return load_from_path(model_path)
|
||||
```
|
||||
|
||||
## Web endpoints
|
||||
|
||||
### FastAPI endpoint decorator
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.fastapi_endpoint(method="POST")
|
||||
def predict(text: str) -> dict:
|
||||
return {"result": model.predict(text)}
|
||||
```
|
||||
|
||||
### Full ASGI app
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
web_app = FastAPI()
|
||||
|
||||
@web_app.post("/predict")
|
||||
async def predict(text: str):
|
||||
return {"result": await model.predict.remote.aio(text)}
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def fastapi_app():
|
||||
return web_app
|
||||
```
|
||||
|
||||
### Web endpoint types
|
||||
|
||||
| Decorator | Use Case |
|
||||
|-----------|----------|
|
||||
| `@modal.fastapi_endpoint()` | Simple function → API |
|
||||
| `@modal.asgi_app()` | Full FastAPI/Starlette apps |
|
||||
| `@modal.wsgi_app()` | Django/Flask apps |
|
||||
| `@modal.web_server(port)` | Arbitrary HTTP servers |
|
||||
|
||||
## Dynamic batching
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.batched(max_batch_size=32, wait_ms=100)
|
||||
async def batch_predict(inputs: list[str]) -> list[dict]:
|
||||
# Inputs automatically batched
|
||||
return model.batch_predict(inputs)
|
||||
```
|
||||
|
||||
## Secrets management
|
||||
|
||||
```bash
|
||||
# Create secret
|
||||
modal secret create huggingface HF_TOKEN=hf_xxx
|
||||
```
|
||||
|
||||
```python
|
||||
@app.function(secrets=[modal.Secret.from_name("huggingface")])
|
||||
def download_model():
|
||||
import os
|
||||
token = os.environ["HF_TOKEN"]
|
||||
```
|
||||
|
||||
## Scheduling
|
||||
|
||||
```python
|
||||
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight
|
||||
def daily_job():
|
||||
pass
|
||||
|
||||
@app.function(schedule=modal.Period(hours=1))
|
||||
def hourly_job():
|
||||
pass
|
||||
```
|
||||
|
||||
## Performance optimization
|
||||
|
||||
### Cold start mitigation
|
||||
|
||||
```python
|
||||
@app.function(
|
||||
container_idle_timeout=300, # Keep warm 5 min
|
||||
allow_concurrent_inputs=10, # Handle concurrent requests
|
||||
)
|
||||
def inference():
|
||||
pass
|
||||
```
|
||||
|
||||
### Model loading best practices
|
||||
|
||||
```python
|
||||
@app.cls(gpu="A100")
|
||||
class Model:
|
||||
@modal.enter() # Run once at container start
|
||||
def load(self):
|
||||
self.model = load_model() # Load during warm-up
|
||||
|
||||
@modal.method()
|
||||
def predict(self, x):
|
||||
return self.model(x)
|
||||
```
|
||||
|
||||
## Parallel processing
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process_item(item):
|
||||
return expensive_computation(item)
|
||||
|
||||
@app.function()
|
||||
def run_parallel():
|
||||
items = list(range(1000))
|
||||
# Fan out to parallel containers
|
||||
results = list(process_item.map(items))
|
||||
return results
|
||||
```
|
||||
|
||||
## Common configuration
|
||||
|
||||
```python
|
||||
@app.function(
|
||||
gpu="A100",
|
||||
memory=32768, # 32GB RAM
|
||||
cpu=4, # 4 CPU cores
|
||||
timeout=3600, # 1 hour max
|
||||
container_idle_timeout=120,# Keep warm 2 min
|
||||
retries=3, # Retry on failure
|
||||
concurrency_limit=10, # Max concurrent containers
|
||||
)
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
```python
|
||||
# Test locally
|
||||
if __name__ == "__main__":
|
||||
result = my_function.local()
|
||||
|
||||
# View logs
|
||||
# modal app logs my-app
|
||||
```
|
||||
|
||||
## Common issues
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| Cold start latency | Increase `container_idle_timeout`, use `@modal.enter()` |
|
||||
| GPU OOM | Use larger GPU (`A100-80GB`), enable gradient checkpointing |
|
||||
| Image build fails | Pin dependency versions, check CUDA compatibility |
|
||||
| Timeout errors | Increase `timeout`, add checkpointing |
|
||||
|
||||
## References
|
||||
|
||||
- **[Advanced Usage](references/advanced-usage.md)** - Multi-GPU, distributed training, cost optimization
|
||||
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
|
||||
|
||||
## Resources
|
||||
|
||||
- **Documentation**: https://modal.com/docs
|
||||
- **Examples**: https://github.com/modal-labs/modal-examples
|
||||
- **Pricing**: https://modal.com/pricing
|
||||
- **Discord**: https://discord.gg/modal
|
||||
503
skills/mlops/cloud/modal/references/advanced-usage.md
Normal file
503
skills/mlops/cloud/modal/references/advanced-usage.md
Normal file
@@ -0,0 +1,503 @@
|
||||
# Modal Advanced Usage Guide
|
||||
|
||||
## Multi-GPU Training
|
||||
|
||||
### Single-node multi-GPU
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("multi-gpu-training")
|
||||
image = modal.Image.debian_slim().pip_install("torch", "transformers", "accelerate")
|
||||
|
||||
@app.function(gpu="H100:4", image=image, timeout=7200)
|
||||
def train_multi_gpu():
|
||||
from accelerate import Accelerator
|
||||
|
||||
accelerator = Accelerator()
|
||||
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
|
||||
|
||||
for batch in dataloader:
|
||||
outputs = model(**batch)
|
||||
loss = outputs.loss
|
||||
accelerator.backward(loss)
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
### DeepSpeed integration
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"torch", "transformers", "deepspeed", "accelerate"
|
||||
)
|
||||
|
||||
@app.function(gpu="A100:8", image=image, timeout=14400)
|
||||
def deepspeed_train(config: dict):
|
||||
from transformers import Trainer, TrainingArguments
|
||||
|
||||
args = TrainingArguments(
|
||||
output_dir="/outputs",
|
||||
deepspeed="ds_config.json",
|
||||
fp16=True,
|
||||
per_device_train_batch_size=4,
|
||||
gradient_accumulation_steps=4
|
||||
)
|
||||
|
||||
trainer = Trainer(model=model, args=args, train_dataset=dataset)
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
### Multi-GPU considerations
|
||||
|
||||
For frameworks that re-execute the Python entrypoint (like PyTorch Lightning), use:
|
||||
- `ddp_spawn` or `ddp_notebook` strategy
|
||||
- Run training as a subprocess to avoid issues
|
||||
|
||||
```python
|
||||
@app.function(gpu="H100:4")
|
||||
def train_with_subprocess():
|
||||
import subprocess
|
||||
subprocess.run(["python", "-m", "torch.distributed.launch", "train.py"])
|
||||
```
|
||||
|
||||
## Advanced Container Configuration
|
||||
|
||||
### Multi-stage builds for caching
|
||||
|
||||
```python
|
||||
# Stage 1: Base dependencies (cached)
|
||||
base_image = modal.Image.debian_slim().pip_install("torch", "numpy", "scipy")
|
||||
|
||||
# Stage 2: ML libraries (cached separately)
|
||||
ml_image = base_image.pip_install("transformers", "datasets", "accelerate")
|
||||
|
||||
# Stage 3: Custom code (rebuilt on changes)
|
||||
final_image = ml_image.copy_local_dir("./src", "/app/src")
|
||||
```
|
||||
|
||||
### Custom Dockerfiles
|
||||
|
||||
```python
|
||||
image = modal.Image.from_dockerfile("./Dockerfile")
|
||||
```
|
||||
|
||||
### Installing from Git
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"git+https://github.com/huggingface/transformers.git@main"
|
||||
)
|
||||
```
|
||||
|
||||
### Using uv for faster installs
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().uv_pip_install(
|
||||
"torch", "transformers", "accelerate"
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Class Patterns
|
||||
|
||||
### Lifecycle hooks
|
||||
|
||||
```python
|
||||
@app.cls(gpu="A10G")
|
||||
class InferenceService:
|
||||
@modal.enter()
|
||||
def startup(self):
|
||||
"""Called once when container starts"""
|
||||
self.model = load_model()
|
||||
self.tokenizer = load_tokenizer()
|
||||
|
||||
@modal.exit()
|
||||
def shutdown(self):
|
||||
"""Called when container shuts down"""
|
||||
cleanup_resources()
|
||||
|
||||
@modal.method()
|
||||
def predict(self, text: str):
|
||||
return self.model(self.tokenizer(text))
|
||||
```
|
||||
|
||||
### Concurrent request handling
|
||||
|
||||
```python
|
||||
@app.cls(
|
||||
gpu="A100",
|
||||
allow_concurrent_inputs=20, # Handle 20 requests per container
|
||||
container_idle_timeout=300
|
||||
)
|
||||
class BatchInference:
|
||||
@modal.enter()
|
||||
def load(self):
|
||||
self.model = load_model()
|
||||
|
||||
@modal.method()
|
||||
def predict(self, inputs: list):
|
||||
return self.model.batch_predict(inputs)
|
||||
```
|
||||
|
||||
### Input concurrency vs batching
|
||||
|
||||
- **Input concurrency**: Multiple requests processed simultaneously (async I/O)
|
||||
- **Dynamic batching**: Requests accumulated and processed together (GPU efficiency)
|
||||
|
||||
```python
|
||||
# Input concurrency - good for I/O-bound
|
||||
@app.function(allow_concurrent_inputs=10)
|
||||
async def fetch_data(url: str):
|
||||
async with aiohttp.ClientSession() as session:
|
||||
return await session.get(url)
|
||||
|
||||
# Dynamic batching - good for GPU inference
|
||||
@app.function()
|
||||
@modal.batched(max_batch_size=32, wait_ms=100)
|
||||
async def batch_embed(texts: list[str]) -> list[list[float]]:
|
||||
return model.encode(texts)
|
||||
```
|
||||
|
||||
## Advanced Volumes
|
||||
|
||||
### Volume operations
|
||||
|
||||
```python
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/data": volume})
|
||||
def volume_operations():
|
||||
import os
|
||||
|
||||
# Write data
|
||||
with open("/data/output.txt", "w") as f:
|
||||
f.write("Results")
|
||||
|
||||
# Commit changes (persist to volume)
|
||||
volume.commit()
|
||||
|
||||
# Reload from remote (get latest)
|
||||
volume.reload()
|
||||
```
|
||||
|
||||
### Shared volumes between functions
|
||||
|
||||
```python
|
||||
shared_volume = modal.Volume.from_name("shared-data", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/shared": shared_volume})
|
||||
def writer():
|
||||
with open("/shared/data.txt", "w") as f:
|
||||
f.write("Hello from writer")
|
||||
shared_volume.commit()
|
||||
|
||||
@app.function(volumes={"/shared": shared_volume})
|
||||
def reader():
|
||||
shared_volume.reload() # Get latest
|
||||
with open("/shared/data.txt", "r") as f:
|
||||
return f.read()
|
||||
```
|
||||
|
||||
### Cloud bucket mounts
|
||||
|
||||
```python
|
||||
# Mount S3 bucket
|
||||
bucket = modal.CloudBucketMount(
|
||||
bucket_name="my-bucket",
|
||||
secret=modal.Secret.from_name("aws-credentials")
|
||||
)
|
||||
|
||||
@app.function(volumes={"/s3": bucket})
|
||||
def process_s3_data():
|
||||
# Access S3 files like local filesystem
|
||||
data = open("/s3/data.parquet").read()
|
||||
```
|
||||
|
||||
## Function Composition
|
||||
|
||||
### Chaining functions
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def preprocess(data):
|
||||
return cleaned_data
|
||||
|
||||
@app.function(gpu="T4")
|
||||
def inference(data):
|
||||
return predictions
|
||||
|
||||
@app.function()
|
||||
def postprocess(predictions):
|
||||
return formatted_results
|
||||
|
||||
@app.function()
|
||||
def pipeline(raw_data):
|
||||
cleaned = preprocess.remote(raw_data)
|
||||
predictions = inference.remote(cleaned)
|
||||
results = postprocess.remote(predictions)
|
||||
return results
|
||||
```
|
||||
|
||||
### Parallel fan-out
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process_item(item):
|
||||
return expensive_computation(item)
|
||||
|
||||
@app.function()
|
||||
def parallel_pipeline(items):
|
||||
# Fan out: process all items in parallel
|
||||
results = list(process_item.map(items))
|
||||
return results
|
||||
```
|
||||
|
||||
### Starmap for multiple arguments
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process(x, y, z):
|
||||
return x + y + z
|
||||
|
||||
@app.function()
|
||||
def orchestrate():
|
||||
args = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
|
||||
results = list(process.starmap(args))
|
||||
return results
|
||||
```
|
||||
|
||||
## Advanced Web Endpoints
|
||||
|
||||
### WebSocket support
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI, WebSocket
|
||||
|
||||
app = modal.App("websocket-app")
|
||||
web_app = FastAPI()
|
||||
|
||||
@web_app.websocket("/ws")
|
||||
async def websocket_endpoint(websocket: WebSocket):
|
||||
await websocket.accept()
|
||||
while True:
|
||||
data = await websocket.receive_text()
|
||||
await websocket.send_text(f"Processed: {data}")
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def ws_app():
|
||||
return web_app
|
||||
```
|
||||
|
||||
### Streaming responses
|
||||
|
||||
```python
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
@app.function(gpu="A100")
|
||||
def generate_stream(prompt: str):
|
||||
for token in model.generate_stream(prompt):
|
||||
yield token
|
||||
|
||||
@web_app.get("/stream")
|
||||
async def stream_response(prompt: str):
|
||||
return StreamingResponse(
|
||||
generate_stream.remote_gen(prompt),
|
||||
media_type="text/event-stream"
|
||||
)
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
```python
|
||||
from fastapi import Depends, HTTPException, Header
|
||||
|
||||
async def verify_token(authorization: str = Header(None)):
|
||||
if not authorization or not authorization.startswith("Bearer "):
|
||||
raise HTTPException(status_code=401)
|
||||
token = authorization.split(" ")[1]
|
||||
if not verify_jwt(token):
|
||||
raise HTTPException(status_code=403)
|
||||
return token
|
||||
|
||||
@web_app.post("/predict")
|
||||
async def predict(data: dict, token: str = Depends(verify_token)):
|
||||
return model.predict(data)
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Right-sizing GPUs
|
||||
|
||||
```python
|
||||
# For inference: smaller GPUs often sufficient
|
||||
@app.function(gpu="L40S") # 48GB, best cost/perf for inference
|
||||
def inference():
|
||||
pass
|
||||
|
||||
# For training: larger GPUs for throughput
|
||||
@app.function(gpu="A100-80GB")
|
||||
def training():
|
||||
pass
|
||||
```
|
||||
|
||||
### GPU fallbacks for availability
|
||||
|
||||
```python
|
||||
@app.function(gpu=["H100", "A100", "L40S"]) # Try in order
|
||||
def flexible_compute():
|
||||
pass
|
||||
```
|
||||
|
||||
### Scale to zero
|
||||
|
||||
```python
|
||||
# Default behavior: scale to zero when idle
|
||||
@app.function(gpu="A100")
|
||||
def on_demand():
|
||||
pass
|
||||
|
||||
# Keep containers warm for low latency (costs more)
|
||||
@app.function(gpu="A100", keep_warm=1)
|
||||
def always_ready():
|
||||
pass
|
||||
```
|
||||
|
||||
### Batch processing for efficiency
|
||||
|
||||
```python
|
||||
# Process in batches to reduce cold starts
|
||||
@app.function(gpu="A100")
|
||||
def batch_process(items: list):
|
||||
return [process(item) for item in items]
|
||||
|
||||
# Better than individual calls
|
||||
results = batch_process.remote(all_items)
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Structured logging
|
||||
|
||||
```python
|
||||
import json
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@app.function()
|
||||
def structured_logging(request_id: str, data: dict):
|
||||
logger.info(json.dumps({
|
||||
"event": "inference_start",
|
||||
"request_id": request_id,
|
||||
"input_size": len(data)
|
||||
}))
|
||||
|
||||
result = process(data)
|
||||
|
||||
logger.info(json.dumps({
|
||||
"event": "inference_complete",
|
||||
"request_id": request_id,
|
||||
"output_size": len(result)
|
||||
}))
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### Custom metrics
|
||||
|
||||
```python
|
||||
@app.function(gpu="A100")
|
||||
def monitored_inference(inputs):
|
||||
import time
|
||||
|
||||
start = time.time()
|
||||
results = model.predict(inputs)
|
||||
latency = time.time() - start
|
||||
|
||||
# Log metrics (visible in Modal dashboard)
|
||||
print(f"METRIC latency={latency:.3f}s batch_size={len(inputs)}")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Environment separation
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
env = os.environ.get("MODAL_ENV", "dev")
|
||||
app = modal.App(f"my-service-{env}")
|
||||
|
||||
# Environment-specific config
|
||||
if env == "prod":
|
||||
gpu_config = "A100"
|
||||
timeout = 3600
|
||||
else:
|
||||
gpu_config = "T4"
|
||||
timeout = 300
|
||||
```
|
||||
|
||||
### Zero-downtime deployments
|
||||
|
||||
Modal automatically handles zero-downtime deployments:
|
||||
1. New containers are built and started
|
||||
2. Traffic gradually shifts to new version
|
||||
3. Old containers drain existing requests
|
||||
4. Old containers are terminated
|
||||
|
||||
### Health checks
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.web_endpoint()
|
||||
def health():
|
||||
return {
|
||||
"status": "healthy",
|
||||
"model_loaded": hasattr(Model, "_model"),
|
||||
"gpu_available": torch.cuda.is_available()
|
||||
}
|
||||
```
|
||||
|
||||
## Sandboxes
|
||||
|
||||
### Interactive execution environments
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def run_sandbox():
|
||||
sandbox = modal.Sandbox.create(
|
||||
app=app,
|
||||
image=image,
|
||||
gpu="T4"
|
||||
)
|
||||
|
||||
# Execute code in sandbox
|
||||
result = sandbox.exec("python", "-c", "print('Hello from sandbox')")
|
||||
|
||||
sandbox.terminate()
|
||||
return result
|
||||
```
|
||||
|
||||
## Invoking Deployed Functions
|
||||
|
||||
### From external code
|
||||
|
||||
```python
|
||||
# Call deployed function from any Python script
|
||||
import modal
|
||||
|
||||
f = modal.Function.lookup("my-app", "my_function")
|
||||
result = f.remote(arg1, arg2)
|
||||
```
|
||||
|
||||
### REST API invocation
|
||||
|
||||
```bash
|
||||
# Deployed endpoints accessible via HTTPS
|
||||
curl -X POST https://your-workspace--my-app-predict.modal.run \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello world"}'
|
||||
```
|
||||
494
skills/mlops/cloud/modal/references/troubleshooting.md
Normal file
494
skills/mlops/cloud/modal/references/troubleshooting.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# Modal Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Authentication fails
|
||||
|
||||
**Error**: `modal setup` doesn't complete or token is invalid
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Re-authenticate
|
||||
modal token new
|
||||
|
||||
# Check current token
|
||||
modal config show
|
||||
|
||||
# Set token via environment
|
||||
export MODAL_TOKEN_ID=ak-...
|
||||
export MODAL_TOKEN_SECRET=as-...
|
||||
```
|
||||
|
||||
### Package installation issues
|
||||
|
||||
**Error**: `pip install modal` fails
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Upgrade pip
|
||||
pip install --upgrade pip
|
||||
|
||||
# Install with specific Python version
|
||||
python3.11 -m pip install modal
|
||||
|
||||
# Install from wheel
|
||||
pip install modal --prefer-binary
|
||||
```
|
||||
|
||||
## Container Image Issues
|
||||
|
||||
### Image build fails
|
||||
|
||||
**Error**: `ImageBuilderError: Failed to build image`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Pin package versions to avoid conflicts
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"torch==2.1.0",
|
||||
"transformers==4.36.0", # Pin versions
|
||||
"accelerate==0.25.0"
|
||||
)
|
||||
|
||||
# Use compatible CUDA versions
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04", # Match PyTorch CUDA
|
||||
add_python="3.11"
|
||||
)
|
||||
```
|
||||
|
||||
### Dependency conflicts
|
||||
|
||||
**Error**: `ERROR: Cannot install package due to conflicting dependencies`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Layer dependencies separately
|
||||
base = modal.Image.debian_slim().pip_install("torch")
|
||||
ml = base.pip_install("transformers") # Install after torch
|
||||
|
||||
# Use uv for better resolution
|
||||
image = modal.Image.debian_slim().uv_pip_install(
|
||||
"torch", "transformers"
|
||||
)
|
||||
```
|
||||
|
||||
### Large image builds timeout
|
||||
|
||||
**Error**: Image build exceeds time limit
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Split into multiple layers (better caching)
|
||||
base = modal.Image.debian_slim().pip_install("torch") # Cached
|
||||
ml = base.pip_install("transformers", "datasets") # Cached
|
||||
app = ml.copy_local_dir("./src", "/app") # Rebuilds on code change
|
||||
|
||||
# Download models during build, not runtime
|
||||
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
|
||||
"python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
|
||||
)
|
||||
```
|
||||
|
||||
## GPU Issues
|
||||
|
||||
### GPU not available
|
||||
|
||||
**Error**: `RuntimeError: CUDA not available`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure GPU is specified
|
||||
@app.function(gpu="T4") # Must specify GPU
|
||||
def my_function():
|
||||
import torch
|
||||
assert torch.cuda.is_available()
|
||||
|
||||
# Check CUDA compatibility in image
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
|
||||
add_python="3.11"
|
||||
).pip_install(
|
||||
"torch",
|
||||
index_url="https://download.pytorch.org/whl/cu121" # Match CUDA
|
||||
)
|
||||
```
|
||||
|
||||
### GPU out of memory
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Use larger GPU
|
||||
@app.function(gpu="A100-80GB") # More VRAM
|
||||
def train():
|
||||
pass
|
||||
|
||||
# Enable memory optimization
|
||||
@app.function(gpu="A100")
|
||||
def memory_optimized():
|
||||
import torch
|
||||
torch.backends.cuda.enable_flash_sdp(True)
|
||||
|
||||
# Use gradient checkpointing
|
||||
model.gradient_checkpointing_enable()
|
||||
|
||||
# Mixed precision
|
||||
with torch.autocast(device_type="cuda", dtype=torch.float16):
|
||||
outputs = model(**inputs)
|
||||
```
|
||||
|
||||
### Wrong GPU allocated
|
||||
|
||||
**Error**: Got different GPU than requested
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Use strict GPU selection
|
||||
@app.function(gpu="H100!") # H100! prevents auto-upgrade to H200
|
||||
|
||||
# Specify exact memory variant
|
||||
@app.function(gpu="A100-80GB") # Not just "A100"
|
||||
|
||||
# Check GPU at runtime
|
||||
@app.function(gpu="A100")
|
||||
def check_gpu():
|
||||
import subprocess
|
||||
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
|
||||
print(result.stdout)
|
||||
```
|
||||
|
||||
## Cold Start Issues
|
||||
|
||||
### Slow cold starts
|
||||
|
||||
**Problem**: First request takes too long
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Keep containers warm
|
||||
@app.function(
|
||||
container_idle_timeout=600, # Keep warm 10 min
|
||||
keep_warm=1 # Always keep 1 container ready
|
||||
)
|
||||
def low_latency():
|
||||
pass
|
||||
|
||||
# Load model during container start
|
||||
@app.cls(gpu="A100")
|
||||
class Model:
|
||||
@modal.enter()
|
||||
def load(self):
|
||||
# This runs once at container start, not per request
|
||||
self.model = load_heavy_model()
|
||||
|
||||
# Cache model in volume
|
||||
volume = modal.Volume.from_name("models", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/cache": volume})
|
||||
def cached_model():
|
||||
if os.path.exists("/cache/model"):
|
||||
model = load_from_disk("/cache/model")
|
||||
else:
|
||||
model = download_model()
|
||||
save_to_disk(model, "/cache/model")
|
||||
volume.commit()
|
||||
```
|
||||
|
||||
### Container keeps restarting
|
||||
|
||||
**Problem**: Containers are killed and restarted frequently
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase memory
|
||||
@app.function(memory=32768) # 32GB RAM
|
||||
def memory_heavy():
|
||||
pass
|
||||
|
||||
# Increase timeout
|
||||
@app.function(timeout=3600) # 1 hour
|
||||
def long_running():
|
||||
pass
|
||||
|
||||
# Handle signals gracefully
|
||||
import signal
|
||||
|
||||
def handler(signum, frame):
|
||||
cleanup()
|
||||
exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, handler)
|
||||
```
|
||||
|
||||
## Volume Issues
|
||||
|
||||
### Volume changes not persisting
|
||||
|
||||
**Error**: Data written to volume disappears
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/data": volume})
|
||||
def write_data():
|
||||
with open("/data/file.txt", "w") as f:
|
||||
f.write("data")
|
||||
|
||||
# CRITICAL: Commit changes!
|
||||
volume.commit()
|
||||
```
|
||||
|
||||
### Volume read shows stale data
|
||||
|
||||
**Error**: Reading outdated data from volume
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
@app.function(volumes={"/data": volume})
|
||||
def read_data():
|
||||
# Reload to get latest
|
||||
volume.reload()
|
||||
|
||||
with open("/data/file.txt", "r") as f:
|
||||
return f.read()
|
||||
```
|
||||
|
||||
### Volume mount fails
|
||||
|
||||
**Error**: `VolumeError: Failed to mount volume`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure volume exists
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
# Use absolute path
|
||||
@app.function(volumes={"/data": volume}) # Not "./data"
|
||||
def my_function():
|
||||
pass
|
||||
|
||||
# Check volume in dashboard
|
||||
# modal volume list
|
||||
```
|
||||
|
||||
## Web Endpoint Issues
|
||||
|
||||
### Endpoint returns 502
|
||||
|
||||
**Error**: Gateway timeout or bad gateway
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase timeout
|
||||
@app.function(timeout=300) # 5 min
|
||||
@modal.web_endpoint()
|
||||
def slow_endpoint():
|
||||
pass
|
||||
|
||||
# Return streaming response for long operations
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def streaming_app():
|
||||
async def generate():
|
||||
for i in range(100):
|
||||
yield f"data: {i}\n\n"
|
||||
await process_chunk(i)
|
||||
return StreamingResponse(generate(), media_type="text/event-stream")
|
||||
```
|
||||
|
||||
### Endpoint not accessible
|
||||
|
||||
**Error**: 404 or cannot reach endpoint
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check deployment status
|
||||
modal app list
|
||||
|
||||
# Redeploy
|
||||
modal deploy my_app.py
|
||||
|
||||
# Check logs
|
||||
modal app logs my-app
|
||||
```
|
||||
|
||||
### CORS errors
|
||||
|
||||
**Error**: Cross-origin request blocked
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
|
||||
web_app = FastAPI()
|
||||
web_app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def cors_enabled():
|
||||
return web_app
|
||||
```
|
||||
|
||||
## Secret Issues
|
||||
|
||||
### Secret not found
|
||||
|
||||
**Error**: `SecretNotFound: Secret 'my-secret' not found`
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Create secret via CLI
|
||||
modal secret create my-secret KEY=value
|
||||
|
||||
# List secrets
|
||||
modal secret list
|
||||
|
||||
# Check secret name matches exactly
|
||||
```
|
||||
|
||||
### Secret value not accessible
|
||||
|
||||
**Error**: Environment variable is empty
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure secret is attached
|
||||
@app.function(secrets=[modal.Secret.from_name("my-secret")])
|
||||
def use_secret():
|
||||
import os
|
||||
value = os.environ.get("KEY") # Use get() to handle missing
|
||||
if not value:
|
||||
raise ValueError("KEY not set in secret")
|
||||
```
|
||||
|
||||
## Scheduling Issues
|
||||
|
||||
### Scheduled job not running
|
||||
|
||||
**Error**: Cron job doesn't execute
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Verify cron syntax
|
||||
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily at midnight UTC
|
||||
def daily_job():
|
||||
pass
|
||||
|
||||
# Check timezone (Modal uses UTC)
|
||||
# "0 8 * * *" = 8am UTC, not local time
|
||||
|
||||
# Ensure app is deployed
|
||||
# modal deploy my_app.py
|
||||
```
|
||||
|
||||
### Job runs multiple times
|
||||
|
||||
**Problem**: Scheduled job executes more than expected
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Implement idempotency
|
||||
@app.function(schedule=modal.Cron("0 * * * *"))
|
||||
def hourly_job():
|
||||
job_id = get_current_hour_id()
|
||||
if already_processed(job_id):
|
||||
return
|
||||
process()
|
||||
mark_processed(job_id)
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
@app.function()
|
||||
def debug_function():
|
||||
logging.debug("Debug message")
|
||||
logging.info("Info message")
|
||||
```
|
||||
|
||||
### View container logs
|
||||
|
||||
```bash
|
||||
# Stream logs
|
||||
modal app logs my-app
|
||||
|
||||
# View specific function
|
||||
modal app logs my-app --function my_function
|
||||
|
||||
# View historical logs
|
||||
modal app logs my-app --since 1h
|
||||
```
|
||||
|
||||
### Test locally
|
||||
|
||||
```python
|
||||
# Run function locally without Modal
|
||||
if __name__ == "__main__":
|
||||
result = my_function.local() # Runs on your machine
|
||||
print(result)
|
||||
```
|
||||
|
||||
### Inspect container
|
||||
|
||||
```python
|
||||
@app.function(gpu="T4")
|
||||
def debug_environment():
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
# System info
|
||||
print(f"Python: {sys.version}")
|
||||
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
|
||||
print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)
|
||||
|
||||
# CUDA info
|
||||
import torch
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
print(f"CUDA version: {torch.version.cuda}")
|
||||
print(f"GPU: {torch.cuda.get_device_name(0)}")
|
||||
```
|
||||
|
||||
## Common Error Messages
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| `FunctionTimeoutError` | Function exceeded timeout | Increase `timeout` parameter |
|
||||
| `ContainerMemoryExceeded` | OOM killed | Increase `memory` parameter |
|
||||
| `ImageBuilderError` | Build failed | Check dependencies, pin versions |
|
||||
| `ResourceExhausted` | No GPUs available | Use GPU fallbacks, try later |
|
||||
| `AuthenticationError` | Invalid token | Run `modal token new` |
|
||||
| `VolumeNotFound` | Volume doesn't exist | Use `create_if_missing=True` |
|
||||
| `SecretNotFound` | Secret doesn't exist | Create secret via CLI |
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Documentation**: https://modal.com/docs
|
||||
2. **Examples**: https://github.com/modal-labs/modal-examples
|
||||
3. **Discord**: https://discord.gg/modal
|
||||
4. **Status**: https://status.modal.com
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
Include:
|
||||
- Modal client version: `modal --version`
|
||||
- Python version: `python --version`
|
||||
- Full error traceback
|
||||
- Minimal reproducible code
|
||||
- GPU type if relevant
|
||||
Reference in New Issue
Block a user