Performance Best Practices
This guide covers techniques to maximize performance, reduce latency, and make efficient use of your cmfy.cloud resources.
Maximizing Cache Hits
The biggest performance gains come from cache-aware routing. When models are already loaded on a GPU node, jobs start instantly instead of waiting for downloads.
Use Consistent Model URLs
Cache hits depend on exact URL matches. Always use the same URL for the same model:
// Good: Define URLs once, reuse everywhere
const MODELS = {
SDXL_BASE: 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors',
SDXL_VAE: 'https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors',
};
// Bad: Different URLs for same model
// 'https://huggingface.co/stabilityai/sdxl/...'
// 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/...'
Batch Similar Workflows
Jobs using the same models are more likely to hit cache:
# Good: Process similar jobs together
sdxl_jobs = [job for job in pending_jobs if job.model == 'sdxl']
flux_jobs = [job for job in pending_jobs if job.model == 'flux']
# Submit SDXL jobs first (hit same cached nodes)
for job in sdxl_jobs:
submit(job)
# Then Flux jobs (different cached nodes)
for job in flux_jobs:
submit(job)
Use Popular Models
Popular models are more likely to be cached across the node pool:
| Model | Cache Likelihood | Typical Wait |
|---|---|---|
| SDXL Base 1.0 | Very High | 1-2s |
| SD 1.5 | Very High | 1-2s |
| Popular LoRAs | High | 2-5s |
| Custom/Rare Models | Low | 30-60s |
Reducing Latency
1. Use Webhooks Instead of Polling
Polling adds round-trip latency and wastes API quota:
2. Optimize Workflow Complexity
Simpler workflows execute faster:
// Slower: Many nodes, complex routing
{
nodes: 50,
estimated_time: "45s"
}
// Faster: Minimal required nodes
{
nodes: 12,
estimated_time: "15s"
}
Tips:
- Remove unused nodes
- Combine operations where possible
- Use efficient samplers (DPM++ 2M Karras, Euler a)
- Reduce inference steps if quality allows
3. Right-Size Your Images
Larger images take longer to generate:
| Resolution | Relative Time | Use Case |
|---|---|---|
| 512x512 | 1x | Thumbnails, previews |
| 768x768 | 1.5x | Standard output |
| 1024x1024 | 2-3x | High quality |
| 2048x2048 | 6-10x | Print quality |
Generate at the minimum resolution you need, then upscale if required.
4. Optimize Sampling Steps
More steps = better quality but slower:
| Steps | Quality | Speed |
|---|---|---|
| 10-15 | Draft | Fast |
| 20-30 | Good | Medium |
| 40-50 | High | Slow |
| 50+ | Diminishing returns | Very slow |
For most use cases, 20-30 steps is the sweet spot.
Efficient Resource Usage
1. Respect Concurrency Limits
Don't exceed your concurrent job limit:
class JobQueue {
constructor(maxConcurrent) {
this.maxConcurrent = maxConcurrent;
this.running = new Set();
this.pending = [];
}
async submit(workflow) {
if (this.running.size >= this.maxConcurrent) {
// Wait for a slot
await this.waitForSlot();
}
const job = await api.submitJob(workflow);
this.running.add(job.id);
// Track completion via webhook
return job;
}
onJobComplete(jobId) {
this.running.delete(jobId);
this.processNext();
}
}
2. Use Connection Pooling
Reuse HTTP connections for better performance:
import requests
# Create a session for connection pooling
session = requests.Session()
session.headers.update({
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
})
# All requests reuse the same connection
for workflow in workflows:
response = session.post(
'https://api.cmfy.cloud/v1/jobs',
json={'prompt': workflow}
)
3. Implement Request Queuing
Queue requests locally before sending:
from queue import Queue
from threading import Thread
import time
class RequestQueue:
def __init__(self, rate_limit=60):
self.queue = Queue()
self.rate_limit = rate_limit
self.interval = 60 / rate_limit # Seconds between requests
def enqueue(self, workflow):
self.queue.put(workflow)
def process(self):
while True:
workflow = self.queue.get()
try:
api.submit_job(workflow)
finally:
self.queue.task_done()
time.sleep(self.interval)
# Start background processor
queue = RequestQueue(rate_limit=50) # Stay under 60 RPM
Thread(target=queue.process, daemon=True).start()
# Enqueue without blocking
for workflow in workflows:
queue.enqueue(workflow)
High-Throughput Patterns
Parallel Webhooks Processing
Handle many concurrent webhook callbacks:
from flask import Flask
from concurrent.futures import ThreadPoolExecutor
import queue
app = Flask(__name__)
executor = ThreadPoolExecutor(max_workers=10)
result_queue = queue.Queue()
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.json
# Queue for async processing
executor.submit(process_result, payload)
# Return immediately
return {'received': True}, 200
def process_result(payload):
if payload['status'] == 'completed':
# Download and process images
for url in payload['outputs']['images']:
image = download_image(url)
save_to_storage(image)
result_queue.put(payload)
Batch Submission with Rate Limiting
Submit many jobs while respecting limits:
async function submitBatch(workflows, options = {}) {
const {
maxConcurrent = 5,
delayMs = 100, // Pace requests
} = options;
const results = [];
const pending = [...workflows];
async function worker() {
while (pending.length > 0) {
const workflow = pending.shift();
if (!workflow) break;
try {
const result = await submitJob(workflow);
results.push({ success: true, job: result });
} catch (error) {
if (error.status === 429) {
// Rate limited - put back and wait
pending.unshift(workflow);
await sleep(error.retryAfter * 1000);
} else {
results.push({ success: false, error });
}
}
await sleep(delayMs);
}
}
// Run workers in parallel
const workers = Array(maxConcurrent).fill(null).map(() => worker());
await Promise.all(workers);
return results;
}
Monitoring and Optimization
Track Key Metrics
Monitor these metrics to identify bottlenecks:
const metrics = {
// Timing
queueTime: [], // Time from submit to start
executionTime: [], // GPU execution time
totalTime: [], // End-to-end time
// Cache
cacheHits: 0,
cacheMisses: 0,
// Errors
rateLimits: 0,
failures: 0
};
async function submitWithMetrics(workflow) {
const start = Date.now();
const job = await api.submitJob(workflow);
// Track via webhook
onJobComplete(job.id, (result) => {
const submitTime = new Date(result.created_at).getTime();
const startTime = new Date(result.started_at).getTime();
const endTime = new Date(result.completed_at).getTime();
metrics.queueTime.push(startTime - submitTime);
metrics.executionTime.push(result.execution_time_ms);
metrics.totalTime.push(endTime - submitTime);
});
return job;
}
Analyze Performance Patterns
import statistics
def analyze_metrics(metrics):
report = {
'queue_time': {
'mean': statistics.mean(metrics['queue_time']),
'p95': sorted(metrics['queue_time'])[int(len(metrics['queue_time']) * 0.95)],
'max': max(metrics['queue_time'])
},
'execution_time': {
'mean': statistics.mean(metrics['execution_time']),
'p95': sorted(metrics['execution_time'])[int(len(metrics['execution_time']) * 0.95)]
},
'cache_hit_rate': metrics['cache_hits'] / (metrics['cache_hits'] + metrics['cache_misses'])
}
# Identify issues
if report['queue_time']['p95'] > 30000: # 30s
print("Warning: High queue times - consider spreading load")
if report['cache_hit_rate'] < 0.5:
print("Warning: Low cache hit rate - use consistent model URLs")
return report
Checklist
Before going to production, verify:
- Using webhooks instead of polling
- Model URLs are consistent and stored as constants
- Concurrency limits are respected
- Retry logic uses idempotency keys
- Rate limit responses are handled with backoff
- Connection pooling is enabled
- Metrics are being tracked
- Workflow complexity is minimized
Summary
| Optimization | Impact | Effort |
|---|---|---|
| Use webhooks | High | Low |
| Consistent model URLs | High | Low |
| Connection pooling | Medium | Low |
| Batch similar workflows | Medium | Medium |
| Optimize workflow nodes | Medium | Medium |
| Right-size images | Medium | Low |
| Local request queuing | Low | Medium |
What's Next?
- Cache-Aware Routing - Deep dive into caching
- Rate Limiting - Understand limits
- Fair Queuing - How jobs are scheduled