Cache-Aware Routing
The biggest factor in workflow execution time is model loading. cmfy.cloud uses cache-aware routing to minimize this by sending jobs to nodes that already have your models loaded.
Why Caching Matters
AI models are large files (2-12+ GB). Loading them takes time:
| Model Type | Typical Size | Download Time* | Load Time* |
|---|---|---|---|
| SDXL Checkpoint | 6.5 GB | 30-60s | 5-10s |
| SD 1.5 Checkpoint | 4 GB | 20-40s | 3-5s |
| LoRA | 50-200 MB | 1-5s | <1s |
| VAE | 300-800 MB | 5-10s | 1-2s |
| ControlNet | 1-2 GB | 10-20s | 2-4s |
*Times vary based on network and hardware conditions
Without caching, every job would need to download and load models fresh. With caching, a job can start executing in seconds instead of minutes.
How It Works
When you submit a job, the router:
Step by Step
- Extract Model URLs - The router identifies all model URLs in your workflow
- Query Cache Index - Check which GPU nodes have these models cached
- Score Candidates - Rank nodes by cache coverage and availability
- Route Decision - Send to the best match or the general queue
In this example, Node A is the best choice - it has all three models cached.
Cache Coverage Score
The router calculates a coverage score for each node:
Score = (Cached Models / Required Models) × 100
| Scenario | Score | Routing Decision |
|---|---|---|
| All models cached | 100% | Route directly to node |
| Most models cached | 60-99% | Route directly (still faster) |
| Some models cached | 1-59% | May route to general queue |
| No models cached | 0% | Route to general queue |
When coverage is high (>60%), routing directly is almost always faster, even if the node is slightly busier.
Tips for Maximizing Cache Hits
1. Use Consistent Model URLs
The cache index tracks models by exact URL. These are treated as different models:
❌ Different URLs = No cache hit:
https://huggingface.co/stabilityai/sdxl/model.safetensors
https://huggingface.co/stabilityai/sdxl/resolve/main/model.safetensors
Pick one URL format and use it consistently across all your workflows.
2. Use Popular Models
Models used by many customers are more likely to be cached across the fleet:
- Stable Diffusion XL Base
- Stable Diffusion 1.5
- Popular LoRAs from Civitai
- Standard VAEs and ControlNets
3. Batch Similar Workflows
If you're submitting multiple jobs with the same models, submit them close together. The first job "warms" the cache for subsequent jobs.
4. Minimize Model Variety
Workflows using fewer unique models have better cache hit rates:
✓ Good: 1 checkpoint + 1 LoRA
→ High chance all are cached
✗ Challenging: 1 checkpoint + 5 LoRAs + 3 ControlNets
→ Lower chance all are cached
5. Use Standard Model Types
The router recognizes common node types for model loading:
CheckpointLoaderSimpleLoraLoaderVAELoaderControlNetLoaderCLIPLoaderUNETLoader
Using standard node types ensures models are correctly tracked in the cache index.
Cache Warming
cmfy.cloud proactively warms caches to improve hit rates:
- Predictive warming - Popular models are pre-loaded on multiple nodes
- User pattern warming - If you frequently use certain models, they're kept cached
- Job queue warming - When your job is queued, missing models start downloading
This means even "cache misses" are often faster because warming started before your job reached the front of the queue.
Understanding Wait Times
The estimated wait time in your job response accounts for caching:
{
"job_id": "...",
"status": "queued",
"queue_position": 3,
"estimated_wait_seconds": 45
}
This estimate includes:
- Queue wait time
- Model download time (if needed)
- Expected execution time
Jobs routed to nodes with cached models will have lower estimates.
What's Next?
- Fair Queuing - How jobs are scheduled across users
- Rate Limiting - Understanding your tier's limits