Skip to main content

Performance Best Practices

This guide covers techniques to maximize performance, reduce latency, and make efficient use of your cmfy.cloud resources.

Maximizing Cache Hits

The biggest performance gains come from cache-aware routing. When models are already loaded on a GPU node, jobs start instantly instead of waiting for downloads.

Use Consistent Model URLs

Cache hits depend on exact URL matches. Always use the same URL for the same model:

// Good: Define URLs once, reuse everywhere
const MODELS = {
SDXL_BASE: 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors',
SDXL_VAE: 'https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors',
};

// Bad: Different URLs for same model
// 'https://huggingface.co/stabilityai/sdxl/...'
// 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/...'

Batch Similar Workflows

Jobs using the same models are more likely to hit cache:

# Good: Process similar jobs together
sdxl_jobs = [job for job in pending_jobs if job.model == 'sdxl']
flux_jobs = [job for job in pending_jobs if job.model == 'flux']

# Submit SDXL jobs first (hit same cached nodes)
for job in sdxl_jobs:
submit(job)

# Then Flux jobs (different cached nodes)
for job in flux_jobs:
submit(job)

Popular models are more likely to be cached across the node pool:

ModelCache LikelihoodTypical Wait
SDXL Base 1.0Very High1-2s
SD 1.5Very High1-2s
Popular LoRAsHigh2-5s
Custom/Rare ModelsLow30-60s

Reducing Latency

1. Use Webhooks Instead of Polling

Polling adds round-trip latency and wastes API quota:

2. Optimize Workflow Complexity

Simpler workflows execute faster:

// Slower: Many nodes, complex routing
{
nodes: 50,
estimated_time: "45s"
}

// Faster: Minimal required nodes
{
nodes: 12,
estimated_time: "15s"
}

Tips:

  • Remove unused nodes
  • Combine operations where possible
  • Use efficient samplers (DPM++ 2M Karras, Euler a)
  • Reduce inference steps if quality allows

3. Right-Size Your Images

Larger images take longer to generate:

ResolutionRelative TimeUse Case
512x5121xThumbnails, previews
768x7681.5xStandard output
1024x10242-3xHigh quality
2048x20486-10xPrint quality

Generate at the minimum resolution you need, then upscale if required.

4. Optimize Sampling Steps

More steps = better quality but slower:

StepsQualitySpeed
10-15DraftFast
20-30GoodMedium
40-50HighSlow
50+Diminishing returnsVery slow

For most use cases, 20-30 steps is the sweet spot.

Efficient Resource Usage

1. Respect Concurrency Limits

Don't exceed your concurrent job limit:

class JobQueue {
constructor(maxConcurrent) {
this.maxConcurrent = maxConcurrent;
this.running = new Set();
this.pending = [];
}

async submit(workflow) {
if (this.running.size >= this.maxConcurrent) {
// Wait for a slot
await this.waitForSlot();
}

const job = await api.submitJob(workflow);
this.running.add(job.id);

// Track completion via webhook
return job;
}

onJobComplete(jobId) {
this.running.delete(jobId);
this.processNext();
}
}

2. Use Connection Pooling

Reuse HTTP connections for better performance:

import requests

# Create a session for connection pooling
session = requests.Session()
session.headers.update({
'Authorization': f'Bearer {API_KEY}',
'Content-Type': 'application/json'
})

# All requests reuse the same connection
for workflow in workflows:
response = session.post(
'https://api.cmfy.cloud/v1/jobs',
json={'prompt': workflow}
)

3. Implement Request Queuing

Queue requests locally before sending:

from queue import Queue
from threading import Thread
import time

class RequestQueue:
def __init__(self, rate_limit=60):
self.queue = Queue()
self.rate_limit = rate_limit
self.interval = 60 / rate_limit # Seconds between requests

def enqueue(self, workflow):
self.queue.put(workflow)

def process(self):
while True:
workflow = self.queue.get()
try:
api.submit_job(workflow)
finally:
self.queue.task_done()
time.sleep(self.interval)

# Start background processor
queue = RequestQueue(rate_limit=50) # Stay under 60 RPM
Thread(target=queue.process, daemon=True).start()

# Enqueue without blocking
for workflow in workflows:
queue.enqueue(workflow)

High-Throughput Patterns

Parallel Webhooks Processing

Handle many concurrent webhook callbacks:

from flask import Flask
from concurrent.futures import ThreadPoolExecutor
import queue

app = Flask(__name__)
executor = ThreadPoolExecutor(max_workers=10)
result_queue = queue.Queue()

@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.json

# Queue for async processing
executor.submit(process_result, payload)

# Return immediately
return {'received': True}, 200

def process_result(payload):
if payload['status'] == 'completed':
# Download and process images
for url in payload['outputs']['images']:
image = download_image(url)
save_to_storage(image)

result_queue.put(payload)

Batch Submission with Rate Limiting

Submit many jobs while respecting limits:

async function submitBatch(workflows, options = {}) {
const {
maxConcurrent = 5,
delayMs = 100, // Pace requests
} = options;

const results = [];
const pending = [...workflows];

async function worker() {
while (pending.length > 0) {
const workflow = pending.shift();
if (!workflow) break;

try {
const result = await submitJob(workflow);
results.push({ success: true, job: result });
} catch (error) {
if (error.status === 429) {
// Rate limited - put back and wait
pending.unshift(workflow);
await sleep(error.retryAfter * 1000);
} else {
results.push({ success: false, error });
}
}

await sleep(delayMs);
}
}

// Run workers in parallel
const workers = Array(maxConcurrent).fill(null).map(() => worker());
await Promise.all(workers);

return results;
}

Monitoring and Optimization

Track Key Metrics

Monitor these metrics to identify bottlenecks:

const metrics = {
// Timing
queueTime: [], // Time from submit to start
executionTime: [], // GPU execution time
totalTime: [], // End-to-end time

// Cache
cacheHits: 0,
cacheMisses: 0,

// Errors
rateLimits: 0,
failures: 0
};

async function submitWithMetrics(workflow) {
const start = Date.now();

const job = await api.submitJob(workflow);

// Track via webhook
onJobComplete(job.id, (result) => {
const submitTime = new Date(result.created_at).getTime();
const startTime = new Date(result.started_at).getTime();
const endTime = new Date(result.completed_at).getTime();

metrics.queueTime.push(startTime - submitTime);
metrics.executionTime.push(result.execution_time_ms);
metrics.totalTime.push(endTime - submitTime);
});

return job;
}

Analyze Performance Patterns

import statistics

def analyze_metrics(metrics):
report = {
'queue_time': {
'mean': statistics.mean(metrics['queue_time']),
'p95': sorted(metrics['queue_time'])[int(len(metrics['queue_time']) * 0.95)],
'max': max(metrics['queue_time'])
},
'execution_time': {
'mean': statistics.mean(metrics['execution_time']),
'p95': sorted(metrics['execution_time'])[int(len(metrics['execution_time']) * 0.95)]
},
'cache_hit_rate': metrics['cache_hits'] / (metrics['cache_hits'] + metrics['cache_misses'])
}

# Identify issues
if report['queue_time']['p95'] > 30000: # 30s
print("Warning: High queue times - consider spreading load")

if report['cache_hit_rate'] < 0.5:
print("Warning: Low cache hit rate - use consistent model URLs")

return report

Checklist

Before going to production, verify:

  • Using webhooks instead of polling
  • Model URLs are consistent and stored as constants
  • Concurrency limits are respected
  • Retry logic uses idempotency keys
  • Rate limit responses are handled with backoff
  • Connection pooling is enabled
  • Metrics are being tracked
  • Workflow complexity is minimized

Summary

OptimizationImpactEffort
Use webhooksHighLow
Consistent model URLsHighLow
Connection poolingMediumLow
Batch similar workflowsMediumMedium
Optimize workflow nodesMediumMedium
Right-size imagesMediumLow
Local request queuingLowMedium

What's Next?

Was this page helpful?