Performance Best Practices

This guide covers techniques to maximize performance, reduce latency, and make efficient use of your cmfy.cloud resources.

Maximizing Cache Hits

The biggest performance gains come from cache-aware routing. When models are already loaded on a GPU node, jobs start instantly instead of waiting for downloads.

Use Consistent Model URLs

Cache hits depend on exact URL matches. Always use the same URL for the same model:

// Good: Define URLs once, reuse everywhere
const MODELS = {
  SDXL_BASE: 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors',
  SDXL_VAE: 'https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors',
};

// Bad: Different URLs for same model
// 'https://huggingface.co/stabilityai/sdxl/...'
// 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/...'

Batch Similar Workflows

Jobs using the same models are more likely to hit cache:

# Good: Process similar jobs together
sdxl_jobs = [job for job in pending_jobs if job.model == 'sdxl']
flux_jobs = [job for job in pending_jobs if job.model == 'flux']

# Submit SDXL jobs first (hit same cached nodes)
for job in sdxl_jobs:
    submit(job)

# Then Flux jobs (different cached nodes)
for job in flux_jobs:
    submit(job)

Use Popular Models

Popular models are more likely to be cached across the node pool:

Model	Cache Likelihood	Typical Wait
SDXL Base 1.0	Very High	1-2s
SD 1.5	Very High	1-2s
Popular LoRAs	High	2-5s
Custom/Rare Models	Low	30-60s

Reducing Latency

1. Use Webhooks Instead of Polling

Polling adds round-trip latency and wastes API quota:

2. Optimize Workflow Complexity

Simpler workflows execute faster:

// Slower: Many nodes, complex routing
{
  nodes: 50,
  estimated_time: "45s"
}

// Faster: Minimal required nodes
{
  nodes: 12,
  estimated_time: "15s"
}

Tips:

Remove unused nodes
Combine operations where possible
Use efficient samplers (DPM++ 2M Karras, Euler a)
Reduce inference steps if quality allows

3. Right-Size Your Images

Larger images take longer to generate:

Resolution	Relative Time	Use Case
512x512	1x	Thumbnails, previews
768x768	1.5x	Standard output
1024x1024	2-3x	High quality
2048x2048	6-10x	Print quality

Generate at the minimum resolution you need, then upscale if required.

4. Optimize Sampling Steps

More steps = better quality but slower:

Steps	Quality	Speed
10-15	Draft	Fast
20-30	Good	Medium
40-50	High	Slow
50+	Diminishing returns	Very slow

For most use cases, 20-30 steps is the sweet spot.

Efficient Resource Usage

1. Respect Concurrency Limits

Don't exceed your concurrent job limit:

class JobQueue {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.running = new Set();
    this.pending = [];
  }

  async submit(workflow) {
    if (this.running.size >= this.maxConcurrent) {
      // Wait for a slot
      await this.waitForSlot();
    }

    const job = await api.submitJob(workflow);
    this.running.add(job.id);

    // Track completion via webhook
    return job;
  }

  onJobComplete(jobId) {
    this.running.delete(jobId);
    this.processNext();
  }
}

2. Use Connection Pooling

Reuse HTTP connections for better performance:

import requests

# Create a session for connection pooling
session = requests.Session()
session.headers.update({
    'Authorization': f'Bearer {API_KEY}',
    'Content-Type': 'application/json'
})

# All requests reuse the same connection
for workflow in workflows:
    response = session.post(
        'https://api.cmfy.cloud/v1/jobs',
        json={'prompt': workflow}
    )

3. Implement Request Queuing

Queue requests locally before sending:

from queue import Queue
from threading import Thread
import time

class RequestQueue:
    def __init__(self, rate_limit=60):
        self.queue = Queue()
        self.rate_limit = rate_limit
        self.interval = 60 / rate_limit  # Seconds between requests

    def enqueue(self, workflow):
        self.queue.put(workflow)

    def process(self):
        while True:
            workflow = self.queue.get()
            try:
                api.submit_job(workflow)
            finally:
                self.queue.task_done()
            time.sleep(self.interval)

# Start background processor
queue = RequestQueue(rate_limit=50)  # Stay under 60 RPM
Thread(target=queue.process, daemon=True).start()

# Enqueue without blocking
for workflow in workflows:
    queue.enqueue(workflow)

High-Throughput Patterns

Parallel Webhooks Processing

Handle many concurrent webhook callbacks:

from flask import Flask
from concurrent.futures import ThreadPoolExecutor
import queue

app = Flask(__name__)
executor = ThreadPoolExecutor(max_workers=10)
result_queue = queue.Queue()

@app.route('/webhook', methods=['POST'])
def handle_webhook():
    payload = request.json

    # Queue for async processing
    executor.submit(process_result, payload)

    # Return immediately
    return {'received': True}, 200

def process_result(payload):
    if payload['status'] == 'completed':
        # Download and process images
        for url in payload['outputs']['images']:
            image = download_image(url)
            save_to_storage(image)

    result_queue.put(payload)

Batch Submission with Rate Limiting

Submit many jobs while respecting limits:

async function submitBatch(workflows, options = {}) {
  const {
    maxConcurrent = 5,
    delayMs = 100,  // Pace requests
  } = options;

  const results = [];
  const pending = [...workflows];

  async function worker() {
    while (pending.length > 0) {
      const workflow = pending.shift();
      if (!workflow) break;

      try {
        const result = await submitJob(workflow);
        results.push({ success: true, job: result });
      } catch (error) {
        if (error.status === 429) {
          // Rate limited - put back and wait
          pending.unshift(workflow);
          await sleep(error.retryAfter * 1000);
        } else {
          results.push({ success: false, error });
        }
      }

      await sleep(delayMs);
    }
  }

  // Run workers in parallel
  const workers = Array(maxConcurrent).fill(null).map(() => worker());
  await Promise.all(workers);

  return results;
}

Monitoring and Optimization

Track Key Metrics

Monitor these metrics to identify bottlenecks:

const metrics = {
  // Timing
  queueTime: [],      // Time from submit to start
  executionTime: [],  // GPU execution time
  totalTime: [],      // End-to-end time

  // Cache
  cacheHits: 0,
  cacheMisses: 0,

  // Errors
  rateLimits: 0,
  failures: 0
};

async function submitWithMetrics(workflow) {
  const start = Date.now();

  const job = await api.submitJob(workflow);

  // Track via webhook
  onJobComplete(job.id, (result) => {
    const submitTime = new Date(result.created_at).getTime();
    const startTime = new Date(result.started_at).getTime();
    const endTime = new Date(result.completed_at).getTime();

    metrics.queueTime.push(startTime - submitTime);
    metrics.executionTime.push(result.execution_time_ms);
    metrics.totalTime.push(endTime - submitTime);
  });

  return job;
}

Analyze Performance Patterns

import statistics

def analyze_metrics(metrics):
    report = {
        'queue_time': {
            'mean': statistics.mean(metrics['queue_time']),
            'p95': sorted(metrics['queue_time'])[int(len(metrics['queue_time']) * 0.95)],
            'max': max(metrics['queue_time'])
        },
        'execution_time': {
            'mean': statistics.mean(metrics['execution_time']),
            'p95': sorted(metrics['execution_time'])[int(len(metrics['execution_time']) * 0.95)]
        },
        'cache_hit_rate': metrics['cache_hits'] / (metrics['cache_hits'] + metrics['cache_misses'])
    }

    # Identify issues
    if report['queue_time']['p95'] > 30000:  # 30s
        print("Warning: High queue times - consider spreading load")

    if report['cache_hit_rate'] < 0.5:
        print("Warning: Low cache hit rate - use consistent model URLs")

    return report

Checklist

Before going to production, verify:

Using webhooks instead of polling
Model URLs are consistent and stored as constants
Concurrency limits are respected
Retry logic uses idempotency keys
Rate limit responses are handled with backoff
Connection pooling is enabled
Metrics are being tracked
Workflow complexity is minimized

Summary

Optimization	Impact	Effort
Use webhooks	High	Low
Consistent model URLs	High	Low
Connection pooling	Medium	Low
Batch similar workflows	Medium	Medium
Optimize workflow nodes	Medium	Medium
Right-size images	Medium	Low
Local request queuing	Low	Medium

What's Next?

Cache-Aware Routing - Deep dive into caching
Rate Limiting - Understand limits
Fair Queuing - How jobs are scheduled

Was this page helpful?

Maximizing Cache Hits​

Use Consistent Model URLs​

Batch Similar Workflows​

Use Popular Models​

Reducing Latency​

1. Use Webhooks Instead of Polling​

2. Optimize Workflow Complexity​

3. Right-Size Your Images​

4. Optimize Sampling Steps​

Efficient Resource Usage​

1. Respect Concurrency Limits​

2. Use Connection Pooling​

3. Implement Request Queuing​

High-Throughput Patterns​

Parallel Webhooks Processing​

Batch Submission with Rate Limiting​

Monitoring and Optimization​

Track Key Metrics​

Analyze Performance Patterns​

Checklist​

Summary​

What's Next?​