Building Multi-Agent Creative Systems
Tutorials2026-02-1343 min read

Building Multi-Agent Creative Systems

Multi Agent SystemsCreative AIAutomationPrompt EngineeringContent Pipeline
Building Multi-Agent Creative Systems

Building Multi-Agent Creative Systems: The Technical Blueprint for Prompt-to-Campaign Automation

Part 6 of 6 | By Tamil Selvan Gunasekaran, AI Agent Developer Intern at Autohive


How to Read This Post

This is a builder's guide. It covers the complete technical stack for constructing a multi-agent creative system — one that takes a single text prompt and produces marketing-ready images and videos for every major social media platform.

  • Code examples are in Python, JavaScript, and CLI — copy-paste ready
  • Comparison tables cover pricing, capabilities, and trade-offs for every tool
  • Architecture diagrams are rendered as code blocks
  • Bold terms are defined on first use
  • Each section stands on its own — skip to whatever you need

This is not a product pitch. It is an open blueprint. Every tool referenced here is publicly available, every API is documented, and every pattern can be implemented by a small team — or a single developer with enough coffee.

I built a system like this during my internship at Autohive, developing creative AI agents for real clients. This post extracts the general principles and makes them framework-agnostic, so anyone can follow along regardless of what platform they are building on.

1. The Problem Worth Solving

Here is how a marketing campaign gets produced today at most companies:

Current Creative Workflow:

  Creative Brief (from client or marketing lead)
      │
      ▼
  Creative Director (interprets brief, sets direction)     → 2-4 hours
      │
      ▼
  Graphic Designer (creates images in Photoshop/Figma)     → 4-8 hours
      │
      ▼
  Copywriter (writes captions, hashtags per platform)      → 2-4 hours
      │
      ▼
  Video Editor (cuts clips, adds overlays and music)       → 4-8 hours
      │
      ▼
  Social Media Manager (formats for each platform)         → 2-4 hours
      │
      ▼
  Review Rounds (2-3 iterations across all roles)          → 1-3 days
      │
      ▼
  Final Assets (images + videos + copy for 6 platforms)

  Total: 5 people, 14-28 hours, 2-5 days, ~$2,000-$5,000

Five people. Multiple rounds of revision. Days of turnaround. For a single campaign.

Now here is what the same workflow looks like with a multi-agent creative system:

AI Creative Pipeline:

  Single Text Prompt ("Launch campaign for summer sale, 30% off, beachwear")
      │
      ▼
  Creative Director Agent (interprets, sets direction)     → 5-10 seconds
      │
      ├─── Visual Agent (generates images in all formats)  → 30-60 seconds
      ├─── Copy Agent (writes platform-native captions)    → 10-20 seconds
      └─── Video Agent (produces short-form clips)         → 60-120 seconds
      │
      ▼
  Platform Formatter (deterministic resizing + packaging)  → 5-10 seconds
      │
      ▼
  VLM Quality Check (reviews outputs against brief)        → 10-15 seconds
      │
      ▼
  Final Assets (images + videos + copy for 6 platforms)

  Total: 0 people, ~3-5 minutes, ~$1.74

That is a 1,000x cost reduction and a 500x speed improvement. Not hypothetical — achievable today with the tools I will walk through in this post.

The key insight is not that AI replaces creative work. It is that most of what passes for "creative production" is actually mechanical reformatting — resizing images, adjusting aspect ratios, rewriting the same message in six different tones, exporting videos in platform-specific codecs. A well-designed system uses AI for the genuinely creative decisions and programmatic tools for the deterministic engineering.

Use deterministic logic for what you can, reserve LLM reasoning for what you must.

That principle — which I established early in my work with production AI systems — is the foundation of everything in this post.


2. The Architecture: A System of Agents, Not a Single Agent

The most common mistake in building creative AI systems is trying to make one agent do everything. A single agent that generates images, writes copy, produces video, and formats for platforms will be mediocre at all of them.

The correct architecture is a multi-agent system where specialized agents collaborate like a creative team. Each agent has a focused job. Each agent uses the best tool for that job. And an orchestrator coordinates them.

User Prompt: "Launch campaign for summer sale, 30% off, beachwear collection"
    │
    ▼
┌─────────────────────────────────────────┐
│  Creative Director Agent (Orchestrator) │
│  Model: GPT-5.2 / Claude / Gemini 3    │
│                                         │
│  • Interprets the brief                 │
│  • Defines visual direction + mood      │
│  • Sets tone and style guidelines       │
│  • Delegates to specialist agents       │
│  • Reviews final outputs via VLM        │
└──────────────┬──────────────────────────┘
               │
    ┌──────────┼──────────────┐
    ▼          ▼              ▼
┌────────┐ ┌────────────┐ ┌──────────────┐
│ Visual │ │ Copy       │ │ Video        │
│ Agent  │ │ Agent      │ │ Agent        │
│        │ │            │ │              │
│ Image  │ │ LLM for    │ │ Veo 3.1 /   │
│ Gen API│ │ platform-  │ │ Sora 2 /    │
│ + Sharp│ │ native     │ │ FFmpeg      │
│ / PIL  │ │ captions   │ │ / Remotion  │
└───┬────┘ └─────┬──────┘ └──────┬───────┘
    │            │               │
    ▼            ▼               ▼
 Images in    Platform-       Short-form
 all sizes    specific        video clips
 & formats    captions &      with brand
              hashtags        overlays
    │            │               │
    └────────────┼───────────────┘
                 ▼
    ┌─────────────────────────────┐
    │  Platform Formatter Agent   │
    │  (Deterministic — no LLM)   │
    │                             │
    │  Sharp / PIL / FFmpeg       │
    │                             │
    │  Instagram: 1080x1080,      │
    │    1080x1350, 1080x1920     │
    │  TikTok: 1080x1920         │
    │  LinkedIn: 1200x627        │
    │  Twitter/X: 1600x900       │
    │  Facebook: 1200x630        │
    │  YouTube: 1280x720 thumb   │
    └──────────────┬──────────────┘
                   ▼
    ┌─────────────────────────────┐
    │  VLM Quality Gate           │
    │  Gemini 3 Pro / GPT-4o     │
    │                             │
    │  • Brand consistency check  │
    │  • Text legibility check    │
    │  • Composition review       │
    │  • Accept / Regenerate      │
    └──────────────┬──────────────┘
                   ▼
            Content Folder
         (All assets organized
          by platform)

Why Separate Agents?

ConcernSingle AgentMulti-Agent
Model selectionOne model for everythingBest model per task
Prompt complexityMassive, brittle promptFocused, maintainable prompts
ParallelismSequential executionVisual + Copy + Video in parallel
Error isolationOne failure kills everythingOne agent retries independently
Cost optimizationExpensive model for cheap tasksCheap models for simple tasks
IterationRegenerate everythingRegenerate only the failing part

The Creative Director Agent is the orchestrator. It takes the raw brief, expands it into a structured creative specification, and delegates to the specialist agents. It does not generate any content itself — it directs the production. This separation is critical. Orchestration and creation require different reasoning patterns, and mixing them in a single prompt produces mediocre results at both.


3. The Tech Stack: Every Layer, Every Tool

This is the core of the post. Building a creative pipeline requires tools at six layers: LLMs for orchestration and copy, image generation APIs, programmatic image editing, AI video generation, programmatic video editing, and VLMs for quality control.

3.1 LLMs for Orchestration and Copy

The Creative Director Agent and the Copy Agent are powered by LLMs. You need a model that is strong at instruction following, structured output, and creative writing.

ModelBest ForContext WindowPricing (per 1M tokens)Notes
GPT-5.2Orchestration, structured output1M~$2 input / $8 outputState-of-the-art reasoning and structured output
GPT-4oCopy writing, multimodal understanding128K$2.50 input / $10 outputStrong creative writing, cheaper option
Claude Sonnet 4.5Creative writing, nuance200K~$3 input / $15 outputBest creative writing and tone adaptation
Gemini 3 ProMultimodal, long context1M+~$1.25 input / $10 outputNative vision understanding, multimodal
Llama 4 MaverickSelf-hosted, privacy-sensitive128KFree (compute costs only)Open weights, latest from Meta
Qwen 3Self-hosted, multilingual128KFree (compute costs only)Strong multilingual support

My recommendation: Use GPT-5.2 or Gemini 3 Pro for the orchestrator (you need reliable structured output and long context). Use Claude Sonnet 4.5 for the copy agent (it produces the most natural, platform-adapted writing). If cost is a concern, use Llama 4 Maverick self-hosted for orchestration and a smaller model for copy.

The Copy Agent prompt needs platform-specific intelligence baked in:

COPY_AGENT_SYSTEM_PROMPT = """
You are a social media copywriter. Given a creative brief and visual 
description, write platform-native content for each specified platform.

Platform guidelines:
- Instagram: Visual-first language. 20-30 hashtags (mix high-volume and 
  niche). Emoji usage matching brand voice. CTA in last line. Max 2,200 chars.
- LinkedIn: Professional tone. Industry framing. Longer-form storytelling. 
  Line breaks for "see more" optimization. 3-5 hashtags. Max 3,000 chars.
- TikTok: Casual, trend-aware. Hook in first line. Suggest trending sounds. 
  3-5 hashtags optimized for For You Page. Max 2,200 chars.
- Twitter/X: Punchy, under 280 characters. Thread structure for longer 
  content. 1-2 hashtags maximum. No hashtag spam.
- Facebook: Conversational. Question-based engagement. 1-3 hashtags. 
  Max 63,206 chars but optimal under 500.
- YouTube: SEO-optimized description. Timestamps if applicable. 
  3-5 hashtags. Max 5,000 chars.

Output format: JSON with keys for each platform containing "caption", 
"hashtags" (array), and "cta" (string).
"""

3.2 Image Generation APIs

This is where the creative pipeline starts producing visual assets. The image generation landscape as of early 2026 offers models ranging from $0.008 to $0.24 per image, each with different strengths.

ModelDeveloperBest ForResolutionPrice/ImageText RenderingSpeed
GPT Image 1.5OpenAIGeneral quality, prompt adherenceUp to 1536x1024$0.009-$0.20 (tiered: low/medium/high)Good~10s
Nano Banana ProGoogle (Gemini 3 Pro Image)Overall quality, prompt adherenceUp to 4K$0.02-$0.24 (2K/4K)Good~8s
FLUX 2 ProBlack Forest LabsProfessional quality, fidelityUp to 2048x2048$0.05Moderate~8s
FLUX 2 FlexBlack Forest LabsAdjustable inference, recommended variantUp to 2048x2048~$0.03Moderate~5s
FLUX 2 Klein 9BBlack Forest LabsBudget, LoRA customizationUp to 2048x2048$0.01Moderate~3s
Ideogram 3.0IdeogramText in images, logosUp to 1024x1024~$0.03Best~8s
Seedream 5.0ByteDanceWeb search, reasoningUp to 4K$0.04Good~10s
Qwen ImageAlibabaBilingual text (EN/CN)Up to 1536x1536$0.02Excellent (CJK)~6s

GPT Image 1.5 is the current quality leader for general-purpose generation. It uses a tiered pricing model — low quality at $0.009/image is good enough for iteration, medium at $0.034 for review drafts, and high at $0.133 for final assets:

import openai

client = openai.OpenAI()

# Generate a marketing image
response = client.images.generate(
    model="gpt-image-1.5",
    prompt="""Summer beachwear sale campaign image. Bright, warm colors. 
    Lifestyle photography style — person walking on beach at golden hour 
    wearing flowing summer dress. Text overlay area in top-third. 
    Clean composition suitable for Instagram feed.""",
    size="1024x1024",
    quality="high",
    n=1
)

image_url = response.data[0].url

FLUX 2 Klein 9B is the budget champion at $0.01/image with open weights and LoRA support — meaning you can fine-tune it on your brand's visual style:

import requests

# Via fal.ai or Replicate
response = requests.post(
    "https://fal.run/fal-ai/flux-2/klein-9b",
    headers={"Authorization": "Key YOUR_API_KEY"},
    json={
        "prompt": "Modern summer fashion lookbook, beachwear collection, "
                  "clean white background, editorial photography style",
        "image_size": {"width": 1024, "height": 1024},
        "num_images": 1
    }
)

image_url = response.json()["images"][0]["url"]

Ideogram 3.0 is the model you reach for when your image needs text — sale percentages, brand names, event dates. Other models struggle with legible text rendering; Ideogram was designed for it.

Strategy for a creative pipeline: Generate the base creative image with Nano Banana Pro (rated #1 overall quality with best prompt adherence) or GPT Image 1.5 (high quality). Use FLUX 2 Pro (now a 32B parameter model) if you need higher resolution. Use Ideogram 3.0 specifically for any assets that require embedded text. Use FLUX 2 Klein for rapid iteration during prompt refinement — at $0.01/image, you can generate 100 variations for $1.

3.3 Programmatic Image Editing: Why You Need Both AI and Code

Here is the critical insight that most tutorials miss: AI generates the creative content, but programmatic tools handle the deterministic engineering.

An AI image generation model produces a 1024x1024 image. Your campaign needs that image in:

  • 1080x1080 for Instagram feed
  • 1080x1350 for Instagram portrait
  • 1080x1920 for Instagram Stories and Reels
  • 1200x627 for LinkedIn
  • 1600x900 for Twitter/X
  • 1200x630 for Facebook
  • 1280x720 for YouTube thumbnail

You do not ask the AI to regenerate the image seven times at different aspect ratios. That is expensive, slow, and produces inconsistent results. Instead, you generate one high-quality base image and use programmatic tools to resize, crop, add brand overlays, and export in every format needed.

Sharp (Node.js)

Sharp wraps libvips, one of the fastest image processing libraries in existence. It handles resizing in milliseconds, not seconds.

const sharp = require('sharp');
const path = require('path');

const PLATFORM_SPECS = {
  instagram_feed:    { width: 1080, height: 1080, suffix: 'ig-feed' },
  instagram_portrait:{ width: 1080, height: 1350, suffix: 'ig-portrait' },
  instagram_stories: { width: 1080, height: 1920, suffix: 'ig-stories' },
  linkedin:          { width: 1200, height: 627,  suffix: 'linkedin' },
  twitter:           { width: 1600, height: 900,  suffix: 'twitter' },
  facebook:          { width: 1200, height: 630,  suffix: 'facebook' },
  youtube_thumb:     { width: 1280, height: 720,  suffix: 'yt-thumb' }
};

async function generatePlatformImages(inputPath, outputDir, brandLogoPath) {
  const results = {};

  for (const [platform, spec] of Object.entries(PLATFORM_SPECS)) {
    const outputPath = path.join(outputDir, `campaign-${spec.suffix}.webp`);

    let pipeline = sharp(inputPath)
      .resize(spec.width, spec.height, {
        fit: 'cover',
        position: 'attention'  // libvips smart crop — focuses on subject
      });

    // Add brand logo overlay (bottom-right corner)
    if (brandLogoPath) {
      const logo = await sharp(brandLogoPath)
        .resize(Math.round(spec.width * 0.15))  // 15% of image width
        .toBuffer();

      pipeline = pipeline.composite([{
        input: logo,
        gravity: 'southeast',
        blend: 'over'
      }]);
    }

    await pipeline
      .webp({ quality: 85 })
      .toFile(outputPath);

    results[platform] = outputPath;
  }

  return results;
}

// Usage
generatePlatformImages(
  './base-campaign-image.png',
  './output/images',
  './brand/logo-white.png'
);

The position: 'attention' parameter is worth highlighting. Sharp uses libvips' attention-based cropping, which analyzes the image for regions of interest (high contrast, faces, text) and crops around them. This means a landscape beach scene cropped to Instagram Stories portrait will keep the person centered, not cut their head off. This is deterministic image intelligence — no LLM needed.

Pillow / PIL (Python)

Pillow is the Python equivalent. Better for adding text overlays, watermarks, and complex compositions:

from PIL import Image, ImageDraw, ImageFont
import os

PLATFORM_SPECS = {
    "instagram_feed":     (1080, 1080),
    "instagram_portrait": (1080, 1350),
    "instagram_stories":  (1080, 1920),
    "linkedin":           (1200, 627),
    "twitter":            (1600, 900),
    "facebook":           (1200, 630),
    "youtube_thumb":      (1280, 720),
}

def resize_with_padding(img, target_w, target_h, bg_color=(0, 0, 0)):
    """Resize image to fit within target dimensions, pad remaining space."""
    img_ratio = img.width / img.height
    target_ratio = target_w / target_h

    if img_ratio > target_ratio:
        new_w = target_w
        new_h = int(target_w / img_ratio)
    else:
        new_h = target_h
        new_w = int(target_h * img_ratio)

    resized = img.resize((new_w, new_h), Image.LANCZOS)
    canvas = Image.new("RGB", (target_w, target_h), bg_color)
    offset = ((target_w - new_w) // 2, (target_h - new_h) // 2)
    canvas.paste(resized, offset)
    return canvas

def add_sale_overlay(img, text="30% OFF", position="top-center"):
    """Add a sale badge overlay to the image."""
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype("fonts/Montserrat-Bold.ttf", size=72)

    bbox = draw.textbbox((0, 0), text, font=font)
    text_w = bbox[2] - bbox[0]
    text_h = bbox[3] - bbox[1]
    padding = 20

    if position == "top-center":
        x = (img.width - text_w) // 2
        y = 40
    
    # Draw background rectangle
    draw.rectangle(
        [x - padding, y - padding, x + text_w + padding, y + text_h + padding],
        fill=(220, 50, 50)
    )
    draw.text((x, y), text, font=font, fill="white")
    return img

def generate_all_platforms(base_image_path, output_dir):
    base = Image.open(base_image_path)
    os.makedirs(output_dir, exist_ok=True)

    for platform, (w, h) in PLATFORM_SPECS.items():
        resized = resize_with_padding(base, w, h)
        resized = add_sale_overlay(resized, "30% OFF")
        output_path = os.path.join(output_dir, f"campaign-{platform}.webp")
        resized.save(output_path, "WEBP", quality=85)
        print(f"  {platform}: {output_path}")

ImageMagick (CLI)

For batch processing in CI/CD pipelines or shell scripts, ImageMagick is the universal tool:

#!/bin/bash
# Batch resize a campaign image for all platforms

INPUT="base-campaign.png"
OUTPUT_DIR="./output/images"
mkdir -p "$OUTPUT_DIR"

# Instagram Feed (1080x1080, center crop)
magick "$INPUT" -resize 1080x1080^ -gravity center -extent 1080x1080 \
  "$OUTPUT_DIR/campaign-ig-feed.webp"

# Instagram Stories (1080x1920, smart crop)
magick "$INPUT" -resize 1080x1920^ -gravity center -extent 1080x1920 \
  "$OUTPUT_DIR/campaign-ig-stories.webp"

# LinkedIn (1200x627, letterbox with brand color background)
magick "$INPUT" -resize 1200x627 -background "#0A66C2" -gravity center \
  -extent 1200x627 "$OUTPUT_DIR/campaign-linkedin.webp"

# Twitter/X (1600x900)
magick "$INPUT" -resize 1600x900^ -gravity center -extent 1600x900 \
  "$OUTPUT_DIR/campaign-twitter.webp"

# Add brand watermark to all
for f in "$OUTPUT_DIR"/*.webp; do
  magick "$f" brand-logo.png -gravity southeast -geometry +20+20 \
    -composite "$f"
done

echo "Generated $(ls "$OUTPUT_DIR"/*.webp | wc -l) platform images"

Tool Comparison

ToolLanguageSpeedBest ForFormatsNPM/PyPI
SharpNode.jsFastest (libvips)Resizing, format conversion, WebP/AVIFJPEG, PNG, WebP, AVIF, TIFF, GIF19M+ weekly downloads
PillowPythonModerateText overlays, compositions, watermarksJPEG, PNG, WebP, TIFF, GIF, BMP80M+ monthly downloads
ImageMagickCLIModerateBatch processing, CI/CD pipelines200+ formatsSystem package
JimpNode.js (pure JS)SlowerZero-dependency environmentsJPEG, PNG, BMP, TIFF, GIF4M+ weekly downloads
Canvas APINode.js (node-canvas)ModerateComplex compositions, chartsPNG, JPEG, PDF1M+ weekly downloads
The pattern is clear: AI generates the creative image once. Programmatic tools handle the mechanical transformation into every platform format. This split — creative AI plus deterministic engineering — is the architecture that scales.

3.4 AI Video Generation

Video is where the creative pipeline gets genuinely futuristic. As of early 2026, multiple APIs can generate short-form video clips from text or image prompts, complete with motion, transitions, and in some cases native audio.

ModelDeveloperResolutionDurationAudioPrice/ClipAPI
Veo 3.1Google720p / 1080p / 4KUp to 8sNative audio (dialogue + sound effects + ambient)$0.40-0.75/s via Vertex AIGemini API / Vertex AI
Sora 2 ProOpenAI720p / 1080pUp to 25sSynced audio (dialogue + sound effects)$0.30-0.50/sVideo API
Sora 2OpenAI480p / 720p / 1080pUp to 25sNo audio$0.10/sVideo API
Runway Gen-4RunwayUp to 4KUp to 10sNo~$0.25/s (credit-based)REST API
Seedance 2.0ByteDanceNative 1080pVariableNo~$0.20/sAPI
Pika 2.2Pika Labs720p / 1080pUp to 5sNo~$0.20/clip via fal.aifal.ai
Kling 3 ProKuaishouUp to 1080pUp to 10sNative audio~$0.15/sAPI
HeyGenHeyGen1080pVariableLip-sync audio~$1.50/minREST API
SynthesiaSynthesia1080pVariableLip-sync audio~$2.00/minREST API

Google Veo 3.1 is the current leader for quality and the only major model with native audio generation — meaning the generated video comes with matching sound effects and ambient audio, not silence.

from google import genai
from google.genai import types
import time
import urllib.request

client = genai.Client()

# Generate a marketing video clip with Veo 3.1
operation = client.models.generate_videos(
    model="veo-3.0-generate-preview",  # Veo 3.1 model identifier
    prompt="""Cinematic slow-motion shot of a woman walking along a 
    tropical beach at golden hour, wearing a flowing white summer dress. 
    Camera tracks alongside her. Warm color grading, lens flare from 
    the setting sun. Sound of gentle waves and soft wind.""",
    config=types.GenerateVideosConfig(
        person_generation="allow_adult",
        aspect_ratio="9:16",       # Vertical for TikTok/Reels
        number_of_videos=1,
    ),
)

# Veo is async — poll for completion
while not operation.done:
    time.sleep(20)
    operation = client.operations.get(operation)

# Download the generated video
for i, video in enumerate(operation.result.generated_videos):
    fname = f"campaign-beach-{i}.mp4"
    urllib.request.urlretrieve(video.video.uri, fname)
    print(f"Saved: {fname}")

OpenAI Sora 2 Pro now supports synced audio (dialogue and sound effects) and uses an async workflow with webhook callbacks — you submit a generation request and receive a webhook when the video is ready:

import openai

client = openai.OpenAI()

# Submit video generation request
response = client.videos.generate(
    model="sora-2-pro",
    prompt="""Product showcase of beachwear collection. Clean white 
    background. Items appear one by one with smooth transitions. 
    Professional e-commerce style. 16:9 aspect ratio.""",
    size="1080p",
    duration=10,
    n=1
)

# Response contains a job ID — poll or use webhooks
job_id = response.id
print(f"Video generation started: {job_id}")

# Poll for completion
import time
while True:
    status = client.videos.retrieve(job_id)
    if status.status == "completed":
        video_url = status.data[0].url
        print(f"Video ready: {video_url}")
        break
    elif status.status == "failed":
        print(f"Generation failed: {status.error}")
        break
    time.sleep(10)

Pika 2.2 via fal.ai is the budget option — fast, cheap, and good enough for social media clips:

import requests

response = requests.post(
    "https://fal.run/fal-ai/pika/v2.2/text-to-video",
    headers={"Authorization": "Key YOUR_FAL_KEY"},
    json={
        "prompt": "Summer sale promo — colorful beach accessories "
                  "arranged in a flat lay, camera slowly zooms out "
                  "to reveal the full collection",
        "aspect_ratio": "9:16",
        "duration": 5
    }
)

video_url = response.json()["video"]["url"]

HeyGen and Synthesia serve a different purpose — they generate talking-head videos with AI avatars. If your campaign needs a spokesperson announcing the sale, these are the tools. They are significantly more expensive per minute but produce a fundamentally different type of content.

Strategy for a creative pipeline: Use Veo 3.1 for hero video clips (highest quality, native audio) or Sora 2 Pro (now also with synced audio). Use Seedance 2.0 for cinematic multi-shot storytelling. Use Pika 2.2 for rapid iteration and B-roll clips. Use Sora 2 (budget tier) when you need longer duration without audio at $0.10/s. AI generates raw 5-10 second clips; programmatic tools handle the rest.

3.5 Programmatic Video Editing: Stitching AI Clips into Finished Content

This is the layer most tutorials skip entirely, and it is the most important one for production-quality output.

AI video models generate raw clips — typically 5-10 seconds, no brand intro, no text overlays, no music, no call-to-action. A finished marketing video needs all of those things. That is where programmatic video editing comes in.

Remotion (React-Based Video)

Remotion lets you define video compositions as React components. This is extremely powerful for templated content — brand intros, text overlays, animated lower thirds, and end cards become reusable components.

// campaign-video.jsx — Remotion composition
import { AbsoluteFill, Video, Img, useCurrentFrame, 
         interpolate, Sequence } from 'remotion';

const BrandIntro = () => {
  const frame = useCurrentFrame();
  const opacity = interpolate(frame, [0, 30], [0, 1], {
    extrapolateRight: 'clamp',
  });
  return (
    <AbsoluteFill style={{
      backgroundColor: '#1a1a2e',
      justifyContent: 'center',
      alignItems: 'center',
      opacity,
    }}>
      <Img src="./brand/logo.png" style={{ width: 300 }} />
    </AbsoluteFill>
  );
};

const SaleOverlay = ({ text }) => {
  const frame = useCurrentFrame();
  const y = interpolate(frame, [0, 20], [100, 0], {
    extrapolateRight: 'clamp',
  });
  return (
    <div style={{
      position: 'absolute', bottom: 120, left: 40,
      transform: `translateY(${y}px)`,
      backgroundColor: 'rgba(220, 50, 50, 0.9)',
      padding: '16px 32px', borderRadius: 8,
      color: 'white', fontSize: 48, fontWeight: 'bold',
    }}>
      {text}
    </div>
  );
};

export const CampaignVideo = ({ aiClipSrc, saleText, ctaText }) => {
  return (
    <AbsoluteFill>
      {/* Brand intro: 0-2 seconds */}
      <Sequence from={0} durationInFrames={60}>
        <BrandIntro />
      </Sequence>

      {/* AI-generated clip: 2-10 seconds */}
      <Sequence from={60} durationInFrames={240}>
        <Video src={aiClipSrc} />
        <SaleOverlay text={saleText} />
      </Sequence>

      {/* CTA end card: 10-13 seconds */}
      <Sequence from={300} durationInFrames={90}>
        <AbsoluteFill style={{
          backgroundColor: '#1a1a2e',
          justifyContent: 'center',
          alignItems: 'center',
        }}>
          <div style={{ color: 'white', fontSize: 36 }}>{ctaText}</div>
        </AbsoluteFill>
      </Sequence>
    </AbsoluteFill>
  );
};

Render via CLI:

npx remotion render campaign-video.jsx CampaignVideo \
  --props='{"aiClipSrc":"./clips/beach-scene.mp4","saleText":"30% OFF","ctaText":"Shop Now → mybrand.com"}' \
  --output="./output/campaign-final.mp4" \
  --width=1080 --height=1920 \
  --fps=30

FFmpeg (Universal Video Engine)

FFmpeg is the bedrock. Every video tool eventually calls FFmpeg under the hood. For creative pipelines, it handles the operations that do not need a framework — stitching clips, adding audio tracks, format conversion, and platform-specific encoding.

#!/bin/bash
# Stitch brand intro + AI clip + end card, add background music

INPUT_CLIP="clips/ai-beach-scene.mp4"
BRAND_INTRO="templates/brand-intro-3s.mp4"
END_CARD="templates/end-card-cta-3s.mp4"
MUSIC="audio/upbeat-summer.mp3"
OUTPUT="output/campaign-final.mp4"

# Step 1: Concatenate clips
cat > concat-list.txt << EOF
file '$BRAND_INTRO'
file '$INPUT_CLIP'
file '$END_CARD'
EOF

ffmpeg -f concat -safe 0 -i concat-list.txt \
  -c:v libx264 -preset medium -crf 23 \
  -c:a aac -b:a 128k \
  "temp-concat.mp4"

# Step 2: Add background music (ducked under any existing audio)
ffmpeg -i temp-concat.mp4 -i "$MUSIC" \
  -filter_complex "[0:a]volume=1.0[a1];[1:a]volume=0.3[a2];[a1][a2]amix=inputs=2:duration=shortest" \
  -c:v copy -c:a aac \
  "$OUTPUT"

# Step 3: Generate platform-specific versions
# TikTok/Reels (9:16, 1080x1920)
ffmpeg -i "$OUTPUT" -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2" \
  -c:v libx264 -preset medium -crf 23 -c:a aac \
  "output/campaign-tiktok.mp4"

# LinkedIn/YouTube (16:9, 1920x1080)
ffmpeg -i "$OUTPUT" -vf "scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:(ow-iw)/2:(oh-ih)/2" \
  -c:v libx264 -preset medium -crf 23 -c:a aac \
  "output/campaign-linkedin.mp4"

# Twitter/X (16:9, 1280x720, smaller file)
ffmpeg -i "$OUTPUT" -vf "scale=1280:720" \
  -c:v libx264 -preset medium -crf 26 -c:a aac -b:a 96k \
  "output/campaign-twitter.mp4"

rm temp-concat.mp4 concat-list.txt
echo "Generated platform videos"

MoviePy (Python)

MoviePy is the Python-native option. Ideal when the rest of your pipeline is already in Python:

from moviepy import VideoFileClip, TextClip, CompositeVideoClip, concatenate_videoclips, AudioFileClip

def assemble_campaign_video(ai_clip_path, output_path,
                             brand_intro_path="templates/brand-intro.mp4",
                             music_path="audio/background.mp3",
                             sale_text="30% OFF",
                             cta_text="Shop Now"):
    # Load clips
    brand_intro = VideoFileClip(brand_intro_path).with_duration(3)
    ai_clip = VideoFileClip(ai_clip_path)
    
    # Create sale text overlay
    sale_overlay = TextClip(
        text=sale_text,
        font_size=72,
        color='white',
        bg_color='red',
        font='Montserrat-Bold'
    ).with_duration(ai_clip.duration).with_position(('center', 'bottom'))
    
    # Composite AI clip with text overlay
    main_clip = CompositeVideoClip([ai_clip, sale_overlay])
    
    # Create CTA end card
    cta_clip = TextClip(
        text=cta_text,
        font_size=48,
        color='white',
        bg_color='black',
        size=ai_clip.size,
        font='Montserrat'
    ).with_duration(3)
    
    # Concatenate: intro + main + CTA
    final = concatenate_videoclips([brand_intro, main_clip, cta_clip])
    
    # Add background music
    music = AudioFileClip(music_path).with_duration(final.duration)
    music = music.with_effects([afx.MultiplyVolume(0.3)])
    final = final.with_audio(music)
    
    # Export
    final.write_videofile(output_path, fps=30, codec='libx264')

assemble_campaign_video(
    "clips/ai-beach-scene.mp4",
    "output/campaign-final.mp4"
)

Video Tool Comparison

ToolLanguageBest ForLearning CurveTemplatingProduction Use
RemotionReact/JSTemplated brand videos, animationsMediumExcellent (React components)High
FFmpegCLITranscoding, stitching, format conversionHighLow (shell scripts)Universal
MoviePyPythonScripted editing, Python pipelinesLowModerateGood
ShotstackAPICloud rendering, no infrastructureLowGood (JSON templates)High
The pipeline: AI generates raw 5-10 second clips → Remotion or MoviePy adds brand intros, text overlays, and music → FFmpeg handles final encoding and platform-specific format conversion. Creative AI for the content. Deterministic code for the packaging.

3.6 Vision Language Models for Quality Control

This is the layer that closes the loop. After your pipeline generates images and videos, how do you know they are good? You could review them manually — but that defeats the purpose of automation.

VLMs (Vision Language Models) can review generated content against the original brief and flag issues before any human sees the output.

import google.generativeai as genai
import json

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro")

def quality_check_image(image_path, original_brief, brand_guidelines):
    """Use VLM to review a generated image against the brief."""
    
    image = genai.upload_file(image_path)
    
    response = model.generate_content([
        f"""You are a creative quality reviewer. Evaluate this generated 
        image against the original brief and brand guidelines.

        Original brief: {original_brief}
        Brand guidelines: {brand_guidelines}

        Evaluate on these criteria (score 1-10 each):
        1. Brief adherence: Does the image match what was requested?
        2. Visual quality: Is the image high quality, no artifacts?
        3. Brand consistency: Colors, style, and tone match the brand?
        4. Text legibility: Any text in the image readable and correct?
        5. Composition: Is the focal point clear? Good visual hierarchy?
        6. Platform suitability: Will this work well on social media?

        Output JSON:
        {{
            "scores": {{"brief_adherence": N, "visual_quality": N, ...}},
            "overall_score": N,
            "pass": true/false (true if overall >= 7),
            "issues": ["list of specific issues found"],
            "suggestions": ["specific improvements if score < 8"]
        }}""",
        image
    ])
    
    return json.loads(response.text)

# Usage in the pipeline
result = quality_check_image(
    "output/campaign-ig-feed.webp",
    "Summer beachwear sale, 30% off, warm and inviting",
    "Colors: coral, sand, ocean blue. Style: lifestyle photography. No stock photo feel."
)

if result["pass"]:
    print(f"✅ Image passed QC (score: {result['overall_score']})")
else:
    print(f"❌ Image failed QC (score: {result['overall_score']})")
    print(f"Issues: {result['issues']}")
    # Trigger regeneration with adjusted prompt

The quality control loop:

Generate image/video
    │
    ▼
VLM reviews against brief
    │
    ├── Score >= 7 → Accept, move to platform formatting
    │
    └── Score < 7 → Regenerate with adjusted prompt
                    (max 3 attempts, then flag for human review)

This loop is what makes the system production-ready. Without it, you are publishing every first-draft output and hoping for the best.


4. Platform-Specific Output Requirements

Every platform has specific technical requirements. Getting these wrong means your content gets cropped incorrectly, rejected on upload, or displayed at low quality.

Image Specifications

PlatformFormatDimensionsAspect RatioMax File SizeNotes
Instagram FeedJPEG/PNG1080x10801:130 MBSquare is safest; 1080x1350 (4:5) gets more screen real estate
Instagram StoriesJPEG/PNG1080x19209:1630 MBLeave 250px top/bottom for UI elements
Instagram CarouselJPEG/PNG1080x1080 or 1080x13501:1 or 4:530 MB per slideUp to 10 slides
TikTokJPEG/PNG1080x19209:1610 MBVertical only
LinkedInJPEG/PNG1200x6271.91:110 MBWider landscape format
Twitter/XJPEG/PNG1600x90016:95 MB (JPEG), 15 MB (PNG)Preview crops to center
FacebookJPEG/PNG1200x6301.91:110 MBSimilar to LinkedIn but slightly taller
YouTube ThumbnailJPEG/PNG1280x72016:92 MBMust be eye-catching at small sizes

Video Specifications

PlatformResolutionAspect RatioDurationMax SizeCodecAudio
Instagram Reels1080x19209:1615-90s4 GBH.264AAC, 128kbps
TikTok1080x19209:1615-180s287 MB (mobile)H.264AAC
LinkedIn Video1920x108016:93s-10min5 GBH.264AAC
Twitter/X Video1280x720+16:9 or 1:10.5s-140s512 MBH.264AAC
Facebook Video1080x1080+1:1, 4:5, 16:91s-240min10 GBH.264AAC
YouTube Shorts1080x19209:16Up to 60sH.264AAC
YouTube Standard1920x1080+16:9Up to 12h256 GBH.264/H.265AAC

Caption and Hashtag Strategy

PlatformMax Caption LengthOptimal LengthHashtag StrategyTone
Instagram2,200 chars125-150 chars (feed), 100 chars (stories)20-30 (mix of volume tiers)Visual-first, emoji-friendly
TikTok2,200 chars50-100 chars3-5 (trending + niche)Casual, hook-first
LinkedIn3,000 chars150-300 chars3-5 (industry-specific)Professional, storytelling
Twitter/X280 charsFull 2801-2 (or none)Punchy, direct
Facebook63,206 chars40-80 chars1-3Conversational, question-based
YouTube5,000 chars (description)200-500 chars3-5 (SEO-focused)Informative, keyword-rich
These specs change. Build your platform formatter to read from a configuration file, not hard-coded values. When Instagram changes their feed dimensions (and they will), you update one JSON file, not fifty lines of code.

5. The Orchestration Layer: Connecting the Agents

With all the tools defined, you need something to connect them into a coherent pipeline. There are several approaches, ranging from code-first frameworks to visual workflow builders.

Framework Comparison

FrameworkApproachBest ForLearning CurveProduction Readiness
LangGraphCode-first, graph-basedComplex agent workflows with cyclesMediumHigh
CrewAIRole-based agent teamsSimpler multi-agent setupsLowMedium
AutoGenConversation-based agentsResearch, prototypingMediumMedium
n8nVisual workflow builderNon-developers, rapid prototypingLowMedium
TemporalWorkflow-as-code, durable executionProduction systems with retriesHighVery High
CustomDirect API orchestrationFull control, minimal dependenciesVariesDepends on implementation

Full Pipeline Pseudocode

Here is the complete orchestration logic for a creative campaign pipeline:

# pipeline.py — Complete creative campaign orchestration

import asyncio
from dataclasses import dataclass

@dataclass
class CreativeBrief:
    prompt: str
    brand_name: str
    brand_colors: list[str]
    platforms: list[str]
    style: str  # "photorealistic", "illustrated", "abstract"
    logo_path: str | None = None

@dataclass
class CreativeSpec:
    visual_direction: str
    color_palette: list[str]
    mood: str
    image_prompts: list[str]
    video_prompts: list[str]
    copy_guidelines: dict
    target_platforms: list[str]

async def run_creative_pipeline(brief: CreativeBrief) -> dict:
    """Full pipeline: brief → platform-ready assets."""
    
    # Step 1: Creative Director interprets the brief
    spec = await creative_director_agent(brief)
    
    # Step 2: Specialist agents run IN PARALLEL
    image_task = asyncio.create_task(visual_agent(spec))
    copy_task = asyncio.create_task(copy_agent(spec))
    video_task = asyncio.create_task(video_agent(spec))
    
    images, copy, videos = await asyncio.gather(
        image_task, copy_task, video_task
    )
    
    # Step 3: Platform formatting (deterministic — no LLM)
    formatted_images = await platform_format_images(
        images, spec.target_platforms, brief.logo_path
    )
    formatted_videos = await platform_format_videos(
        videos, spec.target_platforms
    )
    
    # Step 4: VLM quality check
    qc_results = await quality_check_all(
        formatted_images, formatted_videos, brief.prompt
    )
    
    # Step 5: Handle failures — regenerate specific assets
    for asset, result in qc_results.items():
        if not result["pass"]:
            if result["attempts"] < 3:
                regenerated = await regenerate_asset(
                    asset, result["suggestions"], spec
                )
                qc_results[asset] = await quality_check_single(
                    regenerated, brief.prompt
                )
            else:
                flag_for_human_review(asset, result)
    
    # Step 6: Package everything
    return {
        "images": formatted_images,
        "videos": formatted_videos,
        "copy": copy,
        "qc_results": qc_results,
        "total_cost": calculate_total_cost(),
        "generation_time": elapsed_time()
    }

async def creative_director_agent(brief: CreativeBrief) -> CreativeSpec:
    """LLM interprets the brief and produces a structured creative spec."""
    response = await llm_call(
        model="gpt-5.2",
        system="You are a creative director. Interpret the brief and "
               "produce a detailed creative specification.",
        user=f"Brief: {brief.prompt}\nBrand: {brief.brand_name}\n"
             f"Colors: {brief.brand_colors}\nStyle: {brief.style}",
        response_format=CreativeSpec
    )
    return response

async def visual_agent(spec: CreativeSpec) -> list[str]:
    """Generate base images from the creative spec."""
    images = []
    for prompt in spec.image_prompts:
        image_url = await generate_image(
            model="gpt-image-1.5",
            prompt=prompt,
            quality="high",
            size="1024x1024"
        )
        # VLM quick check before proceeding
        check = await vlm_quick_check(image_url, prompt)
        if check["score"] >= 7:
            images.append(image_url)
        else:
            # Retry with refined prompt
            refined_prompt = f"{prompt}. Avoid: {', '.join(check['issues'])}"
            image_url = await generate_image(
                model="gpt-image-1.5",
                prompt=refined_prompt,
                quality="high",
                size="1024x1024"
            )
            images.append(image_url)
    return images

async def video_agent(spec: CreativeSpec) -> list[str]:
    """Generate video clips from the creative spec."""
    videos = []
    for prompt in spec.video_prompts:
        video_url = await generate_video(
            model="veo-3.1",
            prompt=prompt,
            aspect_ratio="9:16",
            duration=8
        )
        videos.append(video_url)
    return videos

# Entry point
if __name__ == "__main__":
    brief = CreativeBrief(
        prompt="Launch campaign for summer sale, 30% off, beachwear collection",
        brand_name="CoastalVibe",
        brand_colors=["#FF6B6B", "#FFA07A", "#87CEEB"],
        platforms=["instagram", "tiktok", "linkedin", "twitter", "facebook", "youtube"],
        style="photorealistic",
        logo_path="./brand/coastal-vibe-logo.png"
    )
    
    results = asyncio.run(run_creative_pipeline(brief))
    print(f"Campaign generated in {results['generation_time']:.1f}s")
    print(f"Total cost: ${results['total_cost']:.2f}")

The key insight in this orchestration: the specialist agents run in parallel. Image generation, copy writing, and video generation are independent tasks. Running them concurrently cuts the pipeline time from ~3 minutes (sequential) to ~2 minutes (parallel, bottlenecked by video generation).


6. Cost Engineering: The $1.74 Campaign

Here is a realistic cost breakdown for a single campaign that produces assets for six platforms.

Per-Campaign Cost Breakdown

ComponentOperationUnit CostQuantitySubtotal
Creative DirectorGPT-5.2 orchestration call~$0.021$0.02
Image GenerationGPT Image 1.5 (medium quality)~$0.0346 images$0.20
Image IterationFLUX 2 Klein (draft iterations)$0.014 drafts$0.04
Copy WritingClaude Sonnet 4.5 (captions for 6 platforms)~$0.031$0.03
Video GenerationVeo 3.1 (8s clip via Vertex AI)~$0.45/clip3 clips$1.35
VLM Quality CheckGemini 3 Pro (image review)~$0.015 checks$0.05
Programmatic EditingSharp + FFmpeg (compute)~$0.00$0.00
Total$1.69

Round up for retries and overhead: ~$1.74 per campaign.

Cost Comparison

                    Human Creative Team          AI Creative Pipeline
                    ─────────────────           ─────────────────────
Images:             $400 (designer, 4h)          $0.24
Copy:               $300 (copywriter, 3h)        $0.05
Video:              $800 (editor, 6h)            $1.35
Direction:          $300 (creative dir, 2h)      $0.02
Formatting:         $200 (social manager, 2h)    $0.00
QC/Review:          $0   (included above)        $0.08
──────────────────────────────────────────────────────────────
Total:              ~$2,000                      ~$1.74
Time:               2-5 days                     3-5 minutes
Ratio:              1x                           1,149x cheaper

Scaling Economics

Campaigns/MonthHuman CostAI CostMonthly Savings
10$20,000$17.40$19,982
50$100,000$87.00$99,913
200$400,000$348.00$399,652

The cost structure inverts completely. With human teams, marginal cost is constant — each additional campaign costs roughly the same. With AI pipelines, the marginal cost is almost entirely API calls, and those costs are dropping every quarter as model competition intensifies.

The real savings are not in the API costs. They are in the time. A campaign that takes 3 minutes instead of 3 days means you can iterate, test, and optimize content at a pace that is structurally impossible with human-only workflows.

7. The Quality Problem: AI + Programmatic Split

This section codifies the principle I keep coming back to: AI for creative generation, programmatic tools for deterministic engineering.

What Each Layer Does

TaskAI or Programmatic?Why?
Interpreting a creative briefAIRequires understanding intent, mood, audience
Generating a base campaign imageAIRequires creative composition, style judgment
Resizing to 1080x1080ProgrammaticExact pixel dimensions — deterministic
Writing platform-native captionsAIRequires understanding platform culture
Adding brand logo overlayProgrammaticFixed position, known dimensions
Generating a video clipAIRequires motion, visual storytelling
Adding brand intro/outroProgrammaticPre-built template, concatenation
Encoding to H.264 AACProgrammaticCodec parameters are deterministic
Reviewing image qualityAI (VLM)Requires visual understanding
File naming and folder organizationProgrammaticConvention-based, deterministic

The Hybrid Pattern

┌─────────────────────────────────────────────┐
│              CREATIVE LAYER (AI)            │
│                                             │
│  LLMs:    Brief interpretation, copy        │
│  Image:   Base creative image generation    │
│  Video:   Raw clip generation               │
│  VLMs:    Quality review and feedback       │
│                                             │
│  Nature:  Non-deterministic, creative       │
│  Cost:    Per-API-call                      │
│  Speed:   Seconds to minutes                │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│          ENGINEERING LAYER (Programmatic)   │
│                                             │
│  Sharp/PIL:   Resize, crop, overlay, format │
│  FFmpeg:      Transcode, stitch, encode     │
│  Remotion:    Brand templates, animations   │
│  File I/O:    Organize, name, package       │
│                                             │
│  Nature:  Deterministic, repeatable         │
│  Cost:    Compute only (near-zero)          │
│  Speed:   Milliseconds to seconds           │
└─────────────────────────────────────────────┘

This split is not just a performance optimization. It is a reliability architecture. The AI layer is inherently non-deterministic — the same prompt might produce slightly different outputs each time. The programmatic layer is perfectly deterministic — the same input always produces the exact same output. By isolating the non-deterministic parts and minimizing their surface area, you make the overall system more predictable and debuggable.

When something goes wrong (and it will), you know exactly where to look:

  • Wrong creative direction? → Creative Director Agent prompt needs adjustment
  • Bad image composition? → Image generation prompt or model choice
  • Wrong dimensions? → Platform formatter config (deterministic, easy to fix)
  • Video has no brand intro? → FFmpeg/Remotion template (deterministic, easy to fix)
  • Caption tone is wrong for LinkedIn? → Copy Agent platform guidelines
The AI generates the atoms of content. The programmatic layer assembles the molecules. Never ask AI to do what a for-loop can do perfectly.

8. The Onboarding Problem: From Zero to Campaign-Ready in Five Minutes

Here is the question that determines whether a creative pipeline is a developer tool or a product: can a non-technical user set it up?

Most AI creative tools assume the user already has prompts, API keys, brand guidelines documents, and technical knowledge. That is a wall. The businesses that need this the most — small agencies, solo entrepreneurs, local retailers — are the ones least equipped to configure it.

The solution is a brand onboarding flow that asks for the minimum viable input and derives everything else.

What Minimal Onboarding Looks Like

The user provides three things:

Onboarding Input:
  1. Logo file (PNG/SVG)         → uploaded via drag-and-drop
  2. Brand colors                → picked via color picker or extracted from logo
  3. Business description        → one paragraph of plain text

That is it. Everything else is derived.

From those three inputs, the system constructs a complete Brand Profile that powers every downstream agent:

Brand Profile (auto-generated):

  From the logo:
    ├── Logo in multiple formats (PNG, WebP, SVG)
    ├── Logo variants (light background, dark background, icon-only)
    ├── Logo safe zone (minimum padding calculated from dimensions)
    └── Dominant colors extracted via VLM analysis

  From the brand colors:
    ├── Primary color, secondary color, accent color
    ├── Text-on-dark and text-on-light contrast pairs
    ├── Gradient combinations for video overlays
    └── Platform-specific color mappings
         (LinkedIn → professional tones, TikTok → vibrant variants)

  From the business description (via LLM):
    ├── Industry category (retail, SaaS, hospitality, etc.)
    ├── Target audience profile
    ├── Brand voice (formal, casual, playful, authoritative)
    ├── Suggested visual style (photorealistic, illustrated, minimal)
    ├── Default hashtag pools per platform
    └── Competitor reference points

How the Agents Use the Brand Profile

Once the Brand Profile exists, every agent in the pipeline reads from it automatically. The user never has to specify brand details again.

AgentWhat It Gets from the Brand Profile
Creative DirectorVisual style, brand voice, industry context, audience profile
Visual AgentColor palette injected into image prompts, style direction
Copy AgentBrand voice, tone guidelines, default hashtag pools
Video AgentColor palette for overlays, mood direction
Platform FormatterLogo file + safe zone for watermark overlay, brand colors for padding
VLM Quality GateBrand colors and style to check consistency against

The implementation is straightforward. The Brand Profile is a JSON document stored per user:

brand_profile = {
    "name": "CoastalVibe",
    "logo": {
        "primary": "assets/logos/coastalvibe-logo.png",
        "icon_only": "assets/logos/coastalvibe-icon.png",
        "on_dark": "assets/logos/coastalvibe-white.png",
        "safe_zone_px": 24
    },
    "colors": {
        "primary": "#FF6B6B",
        "secondary": "#FFA07A",
        "accent": "#87CEEB",
        "text_on_dark": "#FFFFFF",
        "text_on_light": "#1A1A2E",
        "gradient": ["#FF6B6B", "#FFA07A"]
    },
    "voice": {
        "tone": "warm, approachable, aspirational",
        "formality": 0.4,  # 0 = casual, 1 = formal
        "emoji_usage": "moderate",
        "industry": "fashion_retail"
    },
    "defaults": {
        "visual_style": "lifestyle_photography",
        "hashtag_pools": {
            "instagram": ["#beachwear", "#summerstyle", "#coastalvibes", ...],
            "linkedin": ["#retailinnovation", "#fashionbusiness", ...],
            "tiktok": ["#summerfit", "#beachoutfit", "#ootd", ...]
        },
        "target_audience": "Women 25-40, coastal lifestyle, mid-premium"
    }
}

Every agent prompt gets this profile injected as context. The Visual Agent does not just receive "generate a summer sale image" — it receives "generate a summer sale image using the CoastalVibe brand palette (#FF6B6B, #FFA07A, #87CEEB), lifestyle photography style, targeting women 25-40." The Copy Agent does not write generic captions — it writes in a warm, approachable tone with moderate emoji usage and pulls from pre-built hashtag pools.

Color Extraction from Logo

The smartest part of the onboarding is extracting brand colors directly from the uploaded logo, so the user does not even need to know their hex codes:

from PIL import Image
from collections import Counter

def extract_brand_colors(logo_path, num_colors=5):
    """Extract dominant colors from a logo image."""
    img = Image.open(logo_path).convert("RGB")
    img = img.resize((150, 150))  # Downsample for speed
    
    pixels = list(img.getdata())
    
    # Filter out near-white and near-black (background/outline)
    filtered = [
        p for p in pixels
        if not (p[0] > 240 and p[1] > 240 and p[2] > 240)  # not white
        and not (p[0] < 15 and p[1] < 15 and p[2] < 15)     # not black
    ]
    
    # Quantize to reduce similar colors
    quantized = [
        (r // 32 * 32, g // 32 * 32, b // 32 * 32)
        for r, g, b in filtered
    ]
    
    most_common = Counter(quantized).most_common(num_colors)
    hex_colors = [f"#{r:02x}{g:02x}{b:02x}" for (r, g, b), _ in most_common]
    
    return hex_colors

# Usage
colors = extract_brand_colors("uploads/coastalvibe-logo.png")
# → ["#e06060", "#e09060", "#80c0e0", "#402020", "#c0a080"]

For more sophisticated extraction, pass the logo to a VLM and ask it to identify the brand colors, suggest complementary palettes, and recommend which color should be primary versus accent. The VLM understands visual hierarchy in a way that pixel counting cannot.

Why This Matters

The onboarding flow transforms the creative pipeline from a developer tool into a self-service product. A bakery owner uploads their logo, picks their pink-and-cream color scheme, types "artisan bakery specializing in sourdough bread and pastries," and the system knows enough to generate on-brand campaigns from a single prompt forever after.

No design brief template. No brand guidelines PDF. No creative agency consultation. Upload, describe, go.

The best onboarding is the one where the user provides the minimum and the system derives the maximum. Three inputs — logo, colors, description — are enough to power an infinite number of on-brand campaigns.

9. What Is Coming Next

The creative AI stack is moving fast. Here is what I expect to reshape this space within the next 12 months.

Real-Time Video Generation

Veo 3.1 and Sora 2 currently take 30-120 seconds per clip. Within a year, we will see generation times drop to under 10 seconds for short clips, enabling real-time creative iteration — generate a clip, review it, adjust the prompt, regenerate, all within a conversational loop.

Consistent Characters Across Assets

One of the hardest problems today: generating multiple images where the same character appears consistently. FLUX 2 from Black Forest Labs now supports native multi-reference consistency with up to 10 reference images, maintaining identity, pose style, and clothing across shots. FLUX Kontext extends this further for scene-level editing and re-contextualization. Pikascenes from Pika Labs enables scene-consistent video generation. With these capabilities already available, a campaign can have a consistent model/character across every image and video asset without needing a real photoshoot.

Audio-Native Video

Veo 3.1, Sora 2 Pro, and Kling 3 Pro all now generate video with native audio — sound effects, ambient noise, and dialogue. The pipeline no longer needs a separate audio layer for most use cases. The frontier is now about improving audio quality and enabling multi-speaker dialogue with distinct voices and natural conversational flow.

Closed-Loop Optimization

The next frontier: agents that do not just create content but post it, measure engagement, and optimize future content based on results. The pipeline becomes:

Generate campaign → Post to platforms → Measure engagement (24-48h)
    ↑                                              │
    └──── Adjust creative direction ◄──────────────┘

An agent monitors click-through rates, engagement ratios, and conversion metrics. Low-performing assets get replaced. High-performing patterns get reinforced. The creative system learns what works for your specific audience, not from general training data but from your actual performance data.

Personalized Campaigns at Scale

Instead of one campaign for everyone, generate personalized variants for audience segments. A summer sale campaign might produce beach imagery for coastal markets, pool imagery for suburban markets, and rooftop imagery for urban markets — all from a single brief, all generated in under 5 minutes total.


10. Builder's Checklist

If you are building a multi-agent creative system, here is the implementation order I recommend:

  • [ ] Choose your LLM for orchestration (GPT-5.2 or Gemini 3 Pro) and copy writing (Claude Sonnet 4.5)
  • [ ] Set up image generation with one primary API (GPT Image 1.5 recommended) and one budget option for iteration (FLUX 2 Klein)
  • [ ] Implement programmatic image resizing for all target platforms using Sharp (Node.js) or Pillow (Python)
  • [ ] Build the platform spec config file — all dimensions, aspect ratios, and format requirements in one JSON
  • [ ] Implement the Creative Director Agent with structured output (JSON creative spec)
  • [ ] Build the Copy Agent with platform-specific system prompts
  • [ ] Add video generation with one provider (Veo 3.1 or Pika 2.2 to start)
  • [ ] Implement programmatic video assembly — brand intro + AI clip + CTA end card using FFmpeg or Remotion
  • [ ] Add VLM quality check loop (Gemini 3 Pro — generate → review → accept/regenerate)
  • [ ] Wire the agents together with async orchestration (parallel execution for visual/copy/video)
  • [ ] Build the brand onboarding flow — logo upload, color extraction, business description → auto-generated Brand Profile
  • [ ] Add brand asset management — logo overlays, color palettes, font files
  • [ ] Implement cost tracking per campaign (log every API call with its cost)
  • [ ] Build the output packager — organized folders per platform with proper naming conventions
  • [ ] Add error handling and retry logic (max 3 regeneration attempts per asset)
  • [ ] Set up monitoring — track generation success rates, average costs, and quality scores over time

11. Closing

I built a system like this — a multi-agent creative pipeline that turns a single text prompt into a complete, platform-optimized marketing campaign. I watched it generate an entire Growth Week's worth of creative assets in minutes instead of days. The technology works. It is not theoretical. It is not a demo.

The tools described in this post are all publicly available today. The architecture is straightforward. The cost is negligible compared to traditional creative production. The quality, with proper VLM review loops, is good enough for production use.

What makes this genuinely transformative is not any single tool — it is the system design. Specialized agents doing focused work. AI for creative decisions, programmatic tools for mechanical engineering. Parallel execution for speed. Quality gates for reliability. The same principles that make any good software system work, applied to creative production.

The creative industry is about to be restructured. Not by replacing humans — the best campaigns will always need human creative direction, human taste, human judgment about what resonates emotionally. But the mechanical production that turns a good idea into platform-ready assets across six social media channels? That can be fully automated, today, for under two dollars.

The future of creative production is not AI versus humans. It is AI handling the mechanics so humans can focus on the meaning.

This is Part 6 of the AI Agent Systems series.


Built by Tamil. Powered by multi-agent orchestration and too many API keys.


Visual Gallery

Multi-agent creative operations war room
Multi-agent creative operations war room
Prompt to campaign automation flow
Prompt to campaign automation flow
Cross-platform social asset generation board
Cross-platform social asset generation board