Skip to main content

Cache-Aware Routing

The biggest factor in workflow execution time is model loading. cmfy.cloud uses cache-aware routing to minimize this by sending jobs to nodes that already have your models loaded.

Why Caching Matters

AI models are large files (2-12+ GB). Loading them takes time:

Model TypeTypical SizeDownload Time*Load Time*
SDXL Checkpoint6.5 GB30-60s5-10s
SD 1.5 Checkpoint4 GB20-40s3-5s
LoRA50-200 MB1-5s<1s
VAE300-800 MB5-10s1-2s
ControlNet1-2 GB10-20s2-4s

*Times vary based on network and hardware conditions

Without caching, every job would need to download and load models fresh. With caching, a job can start executing in seconds instead of minutes.

How It Works

When you submit a job, the router:

Step by Step

  1. Extract Model URLs - The router identifies all model URLs in your workflow
  2. Query Cache Index - Check which GPU nodes have these models cached
  3. Score Candidates - Rank nodes by cache coverage and availability
  4. Route Decision - Send to the best match or the general queue

In this example, Node A is the best choice - it has all three models cached.

Cache Coverage Score

The router calculates a coverage score for each node:

Score = (Cached Models / Required Models) × 100
ScenarioScoreRouting Decision
All models cached100%Route directly to node
Most models cached60-99%Route directly (still faster)
Some models cached1-59%May route to general queue
No models cached0%Route to general queue

When coverage is high (>60%), routing directly is almost always faster, even if the node is slightly busier.

Tips for Maximizing Cache Hits

1. Use Consistent Model URLs

The cache index tracks models by exact URL. These are treated as different models:

❌ Different URLs = No cache hit:
https://huggingface.co/stabilityai/sdxl/model.safetensors
https://huggingface.co/stabilityai/sdxl/resolve/main/model.safetensors

Pick one URL format and use it consistently across all your workflows.

Models used by many customers are more likely to be cached across the fleet:

  • Stable Diffusion XL Base
  • Stable Diffusion 1.5
  • Popular LoRAs from Civitai
  • Standard VAEs and ControlNets

3. Batch Similar Workflows

If you're submitting multiple jobs with the same models, submit them close together. The first job "warms" the cache for subsequent jobs.

4. Minimize Model Variety

Workflows using fewer unique models have better cache hit rates:

✓ Good: 1 checkpoint + 1 LoRA
→ High chance all are cached

✗ Challenging: 1 checkpoint + 5 LoRAs + 3 ControlNets
→ Lower chance all are cached

5. Use Standard Model Types

The router recognizes common node types for model loading:

  • CheckpointLoaderSimple
  • LoraLoader
  • VAELoader
  • ControlNetLoader
  • CLIPLoader
  • UNETLoader

Using standard node types ensures models are correctly tracked in the cache index.

Cache Warming

cmfy.cloud proactively warms caches to improve hit rates:

  1. Predictive warming - Popular models are pre-loaded on multiple nodes
  2. User pattern warming - If you frequently use certain models, they're kept cached
  3. Job queue warming - When your job is queued, missing models start downloading

This means even "cache misses" are often faster because warming started before your job reached the front of the queue.

Understanding Wait Times

The estimated wait time in your job response accounts for caching:

{
"job_id": "...",
"status": "queued",
"queue_position": 3,
"estimated_wait_seconds": 45
}

This estimate includes:

  • Queue wait time
  • Model download time (if needed)
  • Expected execution time

Jobs routed to nodes with cached models will have lower estimates.

What's Next?

Was this page helpful?