Cache-Aware Routing

The biggest factor in workflow execution time is model loading. cmfy.cloud uses cache-aware routing to minimize this by sending jobs to nodes that already have your models loaded.

Why Caching Matters

AI models are large files (2-12+ GB). Loading them takes time:

Model Type	Typical Size	Download Time*	Load Time*
SDXL Checkpoint	6.5 GB	30-60s	5-10s
SD 1.5 Checkpoint	4 GB	20-40s	3-5s
LoRA	50-200 MB	1-5s	<1s
VAE	300-800 MB	5-10s	1-2s
ControlNet	1-2 GB	10-20s	2-4s

*Times vary based on network and hardware conditions

Without caching, every job would need to download and load models fresh. With caching, a job can start executing in seconds instead of minutes.

How It Works

When you submit a job, the router:

Step by Step

Extract Model URLs - The router identifies all model URLs in your workflow
Query Cache Index - Check which GPU nodes have these models cached
Score Candidates - Rank nodes by cache coverage and availability
Route Decision - Send to the best match or the general queue

In this example, Node A is the best choice - it has all three models cached.

Cache Coverage Score

The router calculates a coverage score for each node:

Score = (Cached Models / Required Models) × 100

Scenario	Score	Routing Decision
All models cached	100%	Route directly to node
Most models cached	60-99%	Route directly (still faster)
Some models cached	1-59%	May route to general queue
No models cached	0%	Route to general queue

When coverage is high (>60%), routing directly is almost always faster, even if the node is slightly busier.

Tips for Maximizing Cache Hits

1. Use Consistent Model URLs

The cache index tracks models by exact URL. These are treated as different models:

❌ Different URLs = No cache hit:
https://huggingface.co/stabilityai/sdxl/model.safetensors
https://huggingface.co/stabilityai/sdxl/resolve/main/model.safetensors

Pick one URL format and use it consistently across all your workflows.

2. Use Popular Models

Models used by many customers are more likely to be cached across the fleet:

Stable Diffusion XL Base
Stable Diffusion 1.5
Popular LoRAs from Civitai
Standard VAEs and ControlNets

3. Batch Similar Workflows

If you're submitting multiple jobs with the same models, submit them close together. The first job "warms" the cache for subsequent jobs.

4. Minimize Model Variety

Workflows using fewer unique models have better cache hit rates:

✓ Good: 1 checkpoint + 1 LoRA
  → High chance all are cached

✗ Challenging: 1 checkpoint + 5 LoRAs + 3 ControlNets
  → Lower chance all are cached

5. Use Standard Model Types

The router recognizes common node types for model loading:

CheckpointLoaderSimple
LoraLoader
VAELoader
ControlNetLoader
CLIPLoader
UNETLoader

Using standard node types ensures models are correctly tracked in the cache index.

Cache Warming

cmfy.cloud proactively warms caches to improve hit rates:

Predictive warming - Popular models are pre-loaded on multiple nodes
User pattern warming - If you frequently use certain models, they're kept cached
Job queue warming - When your job is queued, missing models start downloading

This means even "cache misses" are often faster because warming started before your job reached the front of the queue.

Understanding Wait Times

Your job response confirms it was queued:

{
  "job_id": "...",
  "status": "queued",
  "created_at": "2024-01-15T10:30:00Z"
}

Jobs routed to nodes with cached models start faster since models don't need to be downloaded.

What's Next?

Fair Queuing - How jobs are scheduled across users
Rate Limiting - Understanding your tier's limits

Was this page helpful?

Why Caching Matters​

How It Works​

Step by Step​

Cache Coverage Score​

Tips for Maximizing Cache Hits​

1. Use Consistent Model URLs​

2. Use Popular Models​

3. Batch Similar Workflows​

4. Minimize Model Variety​

5. Use Standard Model Types​

Cache Warming​

Understanding Wait Times​

What's Next?​