What It Is Actually Like to Build AI Agents for a Living: My 3 Months as an AI Agent Developer Intern at Autohive

By Tamil Selvan Gunasekaran

The Call That Changed Everything

In November 2025, people were still arguing online about whether AI agents were real or just over-marketed wrappers. I did not have strong opinions anymore. I was about to work on them directly, and that would answer the question much faster than Twitter ever could.

I joined Autohive as an AI Agent Developer Intern to do the part I care about most: build, ship, break things, fix them, and see what survives contact with real clients.

The next three months were intense. I built more than 15 integrations, worked on evaluation systems, shipped creative agents, dealt with OAuth, marketplace reviews, production bugs, and the general chaos of software that people actually depend on. It was the kind of engineering stretch where every week feels longer than a month.

This post is my attempt to write that period honestly: the 2 AM debugging, the integrations that fought back, the moments that felt futuristic, and the less glamorous work that actually made the system hold together.

If you are curious about what it actually feels like to stand at the frontier of AI agent development — not the conference talks, not the Twitter threads, the actual daily experience — this is that story.

1. Week One: Drinking from the Firehose

Day 1: November 17, 2025

My first day started with two meetings and a codebase that was both enormous and beautifully architected.

Autohive is a production AI agent platform with a custom multi-provider LLM execution engine that supports OpenAI, Anthropic, Google Gemini, Bedrock, xAI, and self-hosted models through Ollama. The architecture is designed for real-time communication throughout — every conversation, tool call, and status update streams live to the dashboard.

The first thing Ninos — the CTO — told me was: "You are not here to learn. You are here to ship." That set the tone for everything that followed.

My onboarding task was to review the Xero integration that another developer, Lohitya, had built. Not just read the code — understand the OAuth flow, the webhook handling, the data mapping between Xero's API schema and Autohive's internal entity model. I had to write up feedback and propose improvements.

This was not an onboarding exercise. This was production code that served paying clients. My review mattered.

I spent the rest of that first week getting my head around the integration architecture. Autohive has a Python-based integrations SDK that runs separately from the main .NET platform. Each integration is a self-contained module with action handlers, OAuth configuration, and a JSON schema that describes what the integration can do. The platform discovers these schemas at runtime and exposes them as tools that agents can call.

The elegance of this design did not hit me immediately. It hit me two weeks later, when I built my first integration in a single day.

The Mailchimp Integration: My First Real Ship

By November 19, I was building the Mailchimp integration from scratch. Not a toy. A production integration for real clients — Nadia and Joe's clients needed email automation through their agents.

This is what "building an integration" actually meant in practice:

Research the API: Read every page of Mailchimp's developer docs. Understand rate limits, authentication flows, data models, pagination patterns, webhook capabilities. This alone took half a day.

Design the solution architecture: Which endpoints do we need? What actions should the agent be able to perform? Send emails, manage lists, create campaigns, check analytics? I documented all of it and created a formal architecture doc.

Implement the action handlers: Each action is a Python function that receives structured input from the agent, calls the Mailchimp API, and returns structured output. Error handling is not optional — a failed API call needs to produce a message the agent can reason about, not a stack trace.

Build the OAuth flow: Mailchimp uses OAuth 2.0 with some quirks. The token exchange, refresh logic, and scope management all had to work perfectly because once this ships, real users are connecting their real Mailchimp accounts.

Test with sample data: Not unit tests in isolation — actual API calls against a real Mailchimp account, verifying that emails send, lists update, and campaigns create correctly.

Document everything: Setup guides, usage instructions, error handling documentation. If another developer — or a client — cannot set this up without calling me, I failed.

I completed all seven Jira tickets (IN-99 through IN-105) for the Mailchimp integration in two days. Research, architecture, implementation, error handling, documentation — the full lifecycle.

That pace never slowed down.

2. Building Agents That Think: Tag Agency and Google Play

The Tag Agency Project: Two Agents That Read 1,800 Emails

Tom runs a creative agency called Tag. His inbox was drowning — over 1,800 emails, a mix of newsletters, client communications, brand pitches, and personal messages. He needed AI to sort through the chaos.

I built two agents for him.

Agent 1: The Inbox Sorter

This agent classifies every email as either a newsletter or a real person. Sounds simple. It is not.

The classification logic had to handle edge cases that would make a rule-based system weep. Marketing emails from real contacts. Automated responses from real people. Newsletters that are personalized enough to look like personal emails. Forwarded chains that start as newsletters and end as personal conversations.

I implemented the classification using the LLM's reasoning capabilities, but the real engineering challenge was the consolidation feature. Once newsletters were identified, the agent consolidated daily and weekly newsletters into a single summary digest — instead of Tom seeing 47 individual newsletter emails, he would see one summary with the highlights from all of them.

The constraint that made this interesting: credit optimization. Processing 1,800+ emails through an LLM is expensive. I had to design the pipeline to batch emails efficiently, minimize token usage per classification, and avoid redundant processing. This was my first encounter with the cost engineering that separates toy AI from production AI.

This is how I approached the token optimization problem:

Naive approach:
  1,800 emails × ~500 tokens/email = 900,000 input tokens
  + classification output per email = ~50 tokens × 1,800 = 90,000 output tokens
  Total: ~990,000 tokens → expensive, slow

Optimized approach:
  1. Pre-filter with regex: strip HTML, signatures, footers → reduce avg to ~200 tokens
  2. Batch classify: group 10-15 emails per LLM call with structured JSON output
  3. Cache sender patterns: once "newsletter@company.com" is classified, skip LLM for future emails from same sender
  4. Two-pass architecture:
     Pass 1: Heuristic classification (sender domain, unsubscribe headers, bulk-send headers)
       → Catches ~60% of newsletters without any LLM call
     Pass 2: LLM classification only for ambiguous emails (~720 remaining)
       → 720 emails / 12 per batch = 60 LLM calls

  Final token usage: ~180,000 tokens (82% reduction)

The two-pass architecture was the key insight. Email headers contain a wealth of classification signals — List-Unsubscribe headers, Precedence: bulk, known newsletter sender domains — that are completely deterministic. By running a heuristic pass first, I reduced the LLM workload to only the genuinely ambiguous cases. This is a pattern I would use again and again: use deterministic logic for what you can, reserve LLM reasoning for what you must.

Agent 2: The Relationship Finder

This one was more ambitious. Given an email thread, the agent would extract the brands and contacts mentioned, then cross-reference them against Tom's LinkedIn connections to find common relationships.

The workflow: email content → entity extraction (brands, people, companies) → LinkedIn API lookup → connection mapping → relationship summary.

I integrated the LinkedIn API for connection lookups, built the brand and contact extraction pipeline, and designed the workflow that connected it all. The agent could take a cold email from a brand and tell Tom: "You are connected to their CMO through Alex, who you worked with at the Cannes project."

That is not automation. That is augmented relationship intelligence. And I built it in my second week.

The Google Play Review Agent: From Custom to Generic

Vedant had built a Google Play review agent specifically for Raygun — it monitored their app reviews and analyzed sentiment. My job was to take that single-client agent and make it work for any business.

The technical challenge: every business cares about different things in their reviews. A retail app cares about "returns" and "shipping." A finance app cares about "security" and "fees." A food delivery app cares about "delivery time" and "cold food."

I designed a generic architecture where:

The agent detects the business type from the app's website URL
Based on business type, it dynamically adjusts its review analysis criteria — which keywords to scan for, which sentiment patterns to flag, which issues to escalate
The UI lets the business configure their own criteria on top of the automatic detection

I renamed the original to "Raygun Google Play Review Agent" to preserve it, built the new generic version, created a configurable UI, implemented the business-type detection, refactored the analysis criteria to be customizable, and updated all documentation and onboarding instructions.

Six Jira tickets. All completed. New avatar created using the Avatar Creator agent — even the branding was AI-generated.

3. The Integration Factory: 15 Integrations in 12 Weeks

Between November 2025 and February 2026, I touched more integrations than I expected to in a year. This is the full map:

Integration	What I Did	Key Challenge
Mailchimp	Full build from scratch	OAuth + campaign management + credit optimization
Zoom	Full build, app submission	App marketplace compliance, meeting APIs
Shopify	GraphQL migration, app submission, compliance webhooks	REST→GraphQL migration before Feb 2026 deadline, GDPR data deletion webhooks
Stripe	Full build, marketplace submission	OAuth for marketplace, migration scripts, separate test accounts
GitHub	Expand for multiple use cases	Output schema corrections, PR automation
Google Ads	Expand actions, debug political errors	API endpoint testing, error category handling
LinkedIn	Migrate to latest SDK, expand actions, build Ads integration	SDK migration in production, Community Management API
TikTok	Full build, sandbox testing	App verification, video posting actions, server downtime handling
Microsoft	Excel, Word, PowerPoint support	Multi-file-type handling, JSON schema expansion
Uber	Integration build	Merged to production
HeyGen	OAuth update	Pushed to production
NZBN	Custom 2-legged OAuth, action expansion	Government API, formal application process
IRD	Due diligence, API sandbox application	Government compliance, due diligence requirements
Retail Express	API integration, beta testing	Client-specific API requirements
Ahref	Integration build	SEO data integration
Float	API review	Client True requirements
Productboard	Assistant agent build	1,400 notes extracted, sub-feature workaround

Each integration taught me something different. Let me walk through the ones that pushed me the hardest.

Shopify: A Masterclass in Production Engineering

The Shopify integration was the project that turned me from someone who writes code into someone who ships production software.

It started simple: expand the existing integration, test it, push it to prod. Then three things happened.

First, the GraphQL migration. Shopify announced that their REST API would lose support by February 2026. Our entire integration was built on REST. I had to migrate every endpoint to GraphQL — different query patterns, different pagination (cursor-based instead of page-based), different error handling. I rewrote the integration, tested locally, and pushed to PR.

The migration was not a simple find-and-replace. REST and GraphQL have fundamentally different data-fetching philosophies:

REST (old):
  GET /admin/api/2024-01/products.json?limit=50&page=2
  → Returns fixed schema, all fields, paginated by page number
  → N+1 problem: need separate calls for variants, images, metafields

GraphQL (new):
  POST /admin/api/2024-01/graphql.json
  {
    query: products(first: 50, after: "cursor_abc") {
      edges {
        node {
          title, handle, variants(first: 10) { edges { node { price, sku } } }
        }
        cursor
      }
      pageInfo { hasNextPage }
    }
  }
  → Select exactly the fields you need
  → Nested resources in a single call
  → Cursor-based pagination (no page numbers)

Every action handler had to be rewritten. The error handling changed completely — GraphQL returns 200 OK even on errors, with error details buried in the response body under userErrors. I had to build a custom error parser that extracts GraphQL-specific errors and translates them into messages the agent can reason about. Rate limiting also changed: Shopify's GraphQL API uses a cost-based throttling system where each query has a computed cost, and you have a budget that refills over time — completely different from the simple request-per-second limit on REST.

Second, the app submission. We wanted Autohive listed in the Shopify App Store. That meant passing their security compliance review. Most of the checks passed automatically. One did not: mandatory privacy webhooks.

Shopify requires any marketplace app to implement data deletion webhooks. When a customer uninstalls your app, you must delete their data within 48 hours and confirm it. Miss that, and your app gets rejected.

I implemented the full compliance pipeline: webhook endpoints for customer data deletion requests, customer data access requests, and shop data erasure requests. Each webhook validates the HMAC signature from Shopify, processes the request, deletes the relevant data from our database, and responds with confirmation.

The HMAC validation is worth explaining because it is a security pattern every integration developer needs to understand:

Webhook Request from Shopify:
  Headers: X-Shopify-Hmac-SHA256: <base64-encoded HMAC>
  Body: { "shop_domain": "store.myshopify.com", ... }

Validation:
  1. Extract raw request body (before any JSON parsing)
  2. Compute HMAC-SHA256 using your app's shared secret as the key
  3. Base64-encode the result
  4. Compare with the header value using constant-time comparison
     (timing-safe to prevent side-channel attacks)
  5. If mismatch → reject with 401 (someone is spoofing webhooks)
  6. If match → process the data deletion request

Without this validation, anyone who discovers your webhook URL could send fake deletion requests and wipe customer data. Security is not a feature — it is a prerequisite.

I wrote a detailed compliance document explaining every change — the HMAC validation flow, the data deletion pipeline, and the specific webhook endpoints required by Shopify's review process.

Third, the production bugs. The Shop parameter had edge cases that only appeared in beta testing. Some Shopify stores had unusual configurations that broke the parameter parsing. I improved the edge-case handling and pushed fixes to PR.

Three separate PRs. Two rounds of review comments. Multiple beta testing cycles. And then it was live, in production, handling real Shopify store data for real clients.

That is what shipping looks like.

Stripe: From Integration to Marketplace

The Stripe integration was similar in scope but different in complexity. We were not just building an integration — we were building a Stripe App that would live inside Stripe's marketplace.

I created a new repository with a full setup guide, implemented OAuth (required for marketplace apps — API key auth is not enough), wrote migration scripts for the database changes, set up a separate test account for the Stripe review team, and prepared the submission.

The OAuth flow for Stripe is its own adventure. Stripe uses a connected account model where your app requests permissions on behalf of the user's Stripe account. The token exchange, scope management, and refresh logic are all subtly different from standard OAuth 2.0. Getting this right took multiple iterations.

This is the flow I implemented, and why each step gets tricky fast:

Stripe OAuth Flow (Connected Accounts):

1. User clicks "Connect Stripe" in Autohive
   → Redirect to: https://connect.stripe.com/oauth/authorize
     ?client_id=ca_xxx
     &response_type=code
     &scope=read_write
     &state=<CSRF_token>        ← prevents cross-site request forgery
     &redirect_uri=<callback>

2. User authorizes on Stripe's hosted page
   → Stripe redirects to callback with ?code=ac_xxx&state=<CSRF_token>

3. Server exchanges code for tokens
   POST https://connect.stripe.com/oauth/token
   {
     grant_type: "authorization_code",
     code: "ac_xxx",
     client_secret: "sk_xxx"    ← NOT the connected account's key
   }
   → Returns: { access_token, refresh_token, stripe_user_id, scope }

4. Key difference from standard OAuth:
   - No token expiration (tokens are long-lived)
   - But users can revoke via Stripe dashboard at any time
   - Must handle "account.application.deauthorized" webhook
   - The access_token IS a Stripe secret key scoped to that account
   - Must be encrypted at rest (AES-256 in our case)

5. Deauthorization webhook:
   When user disconnects → Stripe sends webhook → we revoke stored tokens
   → agent loses access gracefully (not a crash)

The subtlety that cost me a full day: Stripe's OAuth tokens do not expire, but they can be revoked at any time by the user from their Stripe dashboard. If you do not handle the account.application.deauthorized webhook, your agent will try to make API calls with a revoked token and get cryptic 401 errors. I implemented the deauthorization webhook handler so that when a user disconnects, the integration state updates cleanly and the agent knows it no longer has access — instead of failing silently.

When we finally pushed to production and submitted to the Stripe marketplace — and it went live — that was one of the high points of the entire internship. The Stripe team approved our app. It is live now. Real Stripe users can install Autohive directly from the Stripe marketplace.

NZBN and IRD: Government APIs Are a Different Universe

Most API integrations follow a pattern: read docs, get API key, make calls. Government APIs do not follow patterns. They follow bureaucracy.

The NZBN (New Zealand Business Number) API required a formal application process just to get sandbox access. I applied, waited, followed up, waited more. When access finally came through, I discovered they use a 2-legged OAuth flow — no user authorization step, just server-to-server authentication with client credentials. This is unusual enough that I had to implement a custom OAuth handler specifically for it.

The IRD (Inland Revenue Department) was even more involved. They require a due diligence application before they will even consider giving you API access. I completed their requirements, submitted the formal application, and managed the back-and-forth correspondence. When the IRD came back with questions about how we use AI, what LLM providers we connect to, and how data is handled — I was the one who answered them.

Building integrations with government APIs taught me that technical ability is maybe 40% of the job. The other 60% is process management, compliance, documentation, and patience.

4. The Evaluation Arena: Building AI That Judges AI

This was the project where I felt like I was building something that does not exist yet anywhere else.

The Eval Arena is a system where you can pit AI models against each other on real tasks, score them across multiple dimensions, and let an LLM judge evaluate the results — all while tracking cost, latency, and token usage. Think of it as a private, customizable version of LMSYS Chatbot Arena, but designed for production agent evaluation rather than general chat.

The Systems I Actually Shipped

The Admin UI with Wizard-Based AI Test Generation

Most evaluation systems require you to manually write test cases. That does not scale. I built a wizard that lets you describe what you want to test ("customer support for an e-commerce business") and then uses AI to automatically generate test cases — diverse, edge-case-aware, and categorized by task type.

The wizard flow: select agent type → describe the domain → AI generates 10-20 test cases → human reviews and adjusts → tests are saved with versioned evaluation criteria.

The test generation prompt itself was carefully engineered. A naive prompt like "generate test cases for customer support" produces bland, predictable tests. I designed the generation to produce adversarial diversity:

Test Generation Strategy:
  1. Core functionality tests (40%):
     "Customer asks about return policy" — straightforward, must-pass cases

  2. Edge cases (25%):
     "Customer sends message in mixed English and Mandarin"
     "Customer pastes a 5,000-word complaint"
     "Customer asks about a product that was discontinued yesterday"

  3. Adversarial tests (20%):
     "Customer tries to get the agent to ignore its instructions"
     "Customer asks the agent to perform an action outside its scope"
     "Customer provides contradictory information across messages"

  4. Regression tests (15%):
     Generated from historical failure patterns in production logs
     "Scenarios that caused previous agent versions to hallucinate or fail"

Each generated test includes: the input prompt, expected output characteristics (not exact text — behavioral expectations), evaluation criteria with weights, and difficulty classification. The human review step lets you adjust any of these before the test enters the permanent suite.

LLM Judge Scoring and Reasoning

Each test case gets evaluated by a separate judge model. The judge does not just output a number — it provides structured reasoning for each criterion, confidence scores, and flags for uncertainty. I built the Admin pages to display this reasoning so you can actually understand why a model scored the way it did.

The judge prompt is one of the most complex prompts in the entire system. It needs to:

Judge Evaluation Pipeline:

1. Receive: {test_prompt, model_output, expected_behavior, criteria[]}

2. For each criterion (e.g., accuracy, completeness, tone):
   a. Evaluate the output against the criterion
   b. Assign a score (0-10)
   c. Write reasoning (mandatory — no score without explanation)
   d. Assign confidence (0.0-1.0)
      Low confidence triggers: ambiguous criteria, borderline quality,
      domain expertise required

3. Output structured JSON:
   {
     "criteria_scores": [
       { "criterion_code": "accuracy", "score": 8.5,
         "reasoning": "...", "confidence": 0.92 },
       ...
     ],
     "overall_assessment": "...",
     "improvement_suggestions": ["..."]
   }

4. Anti-bias measures:
   - Judge never sees which model produced the output
   - Position randomization in pairwise comparisons
   - No self-judging (model cannot judge its own outputs)
   - Periodic consistency checks (re-judge 10% of samples)

The Leaderboard

Models are ranked using an Elo-based system (the same algorithm used in chess rankings). Win, lose, or tie on each test case. The leaderboard shows per-task-type performance so you can see that Model A is great at customer support but weak at data extraction, while Model B is the opposite.

The Elo calculation follows the Bradley-Terry model:

Expected Score:
  E_A = 1 / (1 + 10^((R_B - R_A) / 400))

After a match:
  R_A_new = R_A + K × (S_A - E_A)

where:
  R_A, R_B = current ratings
  S_A = actual score (1.0 for win, 0.5 for tie, 0.0 for loss)
  K = 32 (update sensitivity — same as FIDE chess)

Win condition: model scores ≥ 7.0 on the test case
Tie condition: both models within 1.0 point of each other
Loss condition: model scores < 7.0 or significantly below opponent

The per-task-type breakdown was critical. An overall Elo is misleading — a model that excels at creative writing but fails at structured data extraction would look mediocre overall. By computing separate Elo ratings per task type, you get actionable intelligence: use this model for this job, that model for that job.

Model Comparison Framework

Side-by-side comparison of any two models across all tasks. Visual diffs of their responses, score breakdowns, cost comparisons, latency comparisons. This lets you make informed decisions about which model to use for which agent.

Self-Improvement Prompts

This was the feature I am most proud of. After a model is evaluated, the system generates "self-improvement" prompts — specific, actionable suggestions for how the agent's prompt could be modified to improve performance on the tasks where it scored poorly. The agent can iteratively refine its own instructions based on evaluation feedback.

This is a closed loop: deploy agent → evaluate → generate improvement suggestions → refine prompt → re-evaluate → deploy improved agent. Automated self-optimization. I wrote the database migration scripts for the new data structures, implemented the backend logic, and pushed to PR.

The Bug-Fixing Marathon

The Eval Arena touched 36 files across the codebase. Auth improvements, input validation, rate limiting, null-safety fixes, casting issues with weights and grader configs, RunTest/RunBatch error handling — the scope was enormous.

One particular bug haunted me for an entire afternoon. The AI test generation feature would intermittently fail with a casting exception. The root cause: the grader configuration stored weights as decimal in the database but the evaluation engine expected double. A silent type mismatch that only surfaced under specific test configurations. The fix was straightforward; finding it was not.

I also implemented rate limiting for the arena endpoints — you do not want a user accidentally triggering 500 model evaluations simultaneously. And auth improvements to ensure the arena is admin-only in production.

On January 22, I wrote in my log: "Completed comprehensive bug fixes and security hardening for the Agent Eval Arena feature including auth improvements, input validation, rate limiting, and null-safety fixes across 36 files." That single sentence represents roughly 40 hours of focused work.

The Recommendations Page was the final piece. It shows LLM-selected models as the primary recommendation — "Based on your evaluation results, we recommend Claude Sonnet for customer support and GPT-4o for data extraction" — with algorithmic scoring data as supporting reference. The system does not just score models. It tells you which one to use.

5. The Creative Pipeline: From Prompt to Every Social Media Platform

This is the project that made me realize I was working on something genuinely ahead of its time.

The brief was deceptively simple: a creative agency needs to produce marketing content. They have ideas. They need those ideas turned into images, videos, captions, hashtags, and formatted posts for Instagram, TikTok, LinkedIn, Twitter, Facebook, and YouTube — each platform with its own dimensions, aspect ratios, tone, and character limits.

Currently, this process involves a creative director, a graphic designer, a copywriter, a video editor, and a social media manager. Five people. Multiple rounds of revision. Days of turnaround.

What if one prompt could do it all?

The Architecture: A Multi-Agent Creative System

I did not build one agent. I built a system of agents that collaborate like a creative team.

User Prompt: "Launch campaign for summer sale, 30% off, beachwear collection"
    │
    ▼
┌─────────────────────────────────────┐
│  Creative Director Agent            │
│  (Orchestrator - LLM)               │
│                                     │
│  • Interprets the brief             │
│  • Defines visual direction         │
│  • Sets tone and style guidelines   │
│  • Delegates to specialist agents   │
└──────────────┬──────────────────────┘
               │
    ┌──────────┼──────────────┐
    ▼          ▼              ▼
┌────────┐ ┌────────────┐ ┌──────────────┐
│ Visual │ │ Copy       │ │ Video        │
│ Agent  │ │ Agent      │ │ Agent        │
│ (VLM)  │ │ (LLM)     │ │ (Gemini Veo) │
└───┬────┘ └─────┬──────┘ └──────┬───────┘
    │            │               │
    ▼            ▼               ▼
 Images      Platform-       Short-form
 in all      specific        video clips
 formats     captions &      with text
 & sizes     hashtags        overlays
    │            │               │
    └────────────┼───────────────┘
                 ▼
    ┌─────────────────────────────┐
    │  Platform Formatter Agent   │
    │                             │
    │  Instagram: 1080x1080,      │
    │    1080x1350, 1080x1920     │
    │  TikTok: 1080x1920         │
    │  LinkedIn: 1200x627        │
    │  Twitter: 1600x900         │
    │  Facebook: 1200x630        │
    │  YouTube: 1280x720 thumb   │
    │                             │
    │  + Platform-specific copy   │
    │  + Hashtag strategies       │
    │  + Posting time suggestions │
    └──────────────┬──────────────┘
                   ▼
            Content Folder
         (All assets saved,
          organized by platform)

The Visual Agent: LLMs Meet Vision Language Models

The Visual Agent is where things get genuinely futuristic. It does not just generate images — it understands visual context and generates images that are strategically designed for marketing impact.

The pipeline uses multiple AI capabilities in sequence:

Prompt Refinement (LLM): The raw creative brief is expanded into a detailed image generation prompt. The LLM understands marketing principles — it knows that a summer sale needs warm colors, outdoor settings, lifestyle imagery. It crafts prompts that will produce marketing-effective visuals, not just technically correct ones.

Image Generation (VLM): The refined prompt is sent to a vision-language model for image generation. The system supports multiple generation backends — depending on the style required, it might use different models for photorealistic vs. illustrated vs. abstract styles.

Image Understanding and Adaptation (VLM): Once the base image is generated, a VLM analyzes it — composition, color palette, text placement opportunities, brand consistency. This analysis feeds into the platform adaptation step.

Multi-Format Output: The system generates the image in every format needed — square for Instagram feed, portrait for Stories and Reels, landscape for LinkedIn and Twitter, thumbnail for YouTube. This is more than cropping. The composition has to shift so the focal point does not die when the aspect ratio changes.

I integrated Gemini Veo for video generation — the agent can take the same creative brief and produce short-form video clips suitable for TikTok, Instagram Reels, and YouTube Shorts. The videos include text overlays, transitions, and are formatted to platform specifications.

One of the technical challenges that consumed multiple days was saving generated assets to content folders. When a Gemini Veo video or an AI-generated image is created within a conversation, it needs to be persisted — not just displayed in chat, but saved to the workspace's content area where other agents can access it, the client can download it, and the platform can track it.

I built the content pipeline that handles this: generated image → content folder with proper naming → metadata tagging → accessible to other agents in the workspace. The "nana banana" image pipeline (as we called it internally) became the foundation for multi-agent content retrieval — any agent in the workspace could pull assets that another agent had generated.

The Copy Agent: Platform-Specific Intelligence

The Copy Agent does not write generic captions. It writes platform-native content.

For Instagram: visual-first language, strategic hashtag sets (mix of high-volume and niche), emoji usage that matches the brand voice, call-to-action in the last line.

For LinkedIn: professional tone, industry-relevant framing, longer-form storytelling, strategic use of line breaks for the "see more" click optimization.

For TikTok: casual, trend-aware language, hook in the first line, relevant trending sounds suggested, hashtag strategy optimized for the For You Page algorithm.

For Twitter/X: punchy, under 280 characters, thread structure for longer content, no hashtag spam.

Each platform output is not just reformatted text — it is re-conceived for the platform's culture, algorithm, and user behavior patterns. The LLM understands these differences because the agent's system prompt includes detailed platform guidelines that I researched and codified.

The Creative Director Agent: Orchestration Is the Hard Part

I collaborated with Wayne and Alex on the Creative Director Agent — the orchestrator that ties everything together. This was the most challenging agent to get right because its job is not to produce content but to direct the production.

The Creative Director Agent:

Interprets vague briefs ("make something cool for our summer launch") into actionable creative specifications
Defines the visual language — color palettes, typography direction, imagery style, mood
Coordinates between the Visual, Copy, and Video agents — ensuring brand consistency across all outputs
Reviews generated assets (using VLM capabilities) and requests revisions if they do not meet the brief
Packages the final deliverables in a structured format organized by platform

The fine-tuning process was iterative and humbling. The first version produced technically correct but creatively bland output. The second version was creative but inconsistent across platforms. The third version — after multiple rounds of prompt engineering, testing with real creative briefs, and feedback from Wayne's design eye — finally hit the sweet spot.

I spent January 27-30 doing nothing but fine-tuning these creative agents for Team Orange's Growth Week. "Couple of testing and iteration" is what my daily log says. What it does not say is that each "iteration" meant regenerating an entire campaign, reviewing every asset across six platforms, identifying what was off (too corporate for TikTok, too casual for LinkedIn, wrong aspect ratio for Stories), adjusting the agent's instructions, and running it again.

By the time Growth Week launched, the pipeline could take a single creative brief and produce a complete, platform-optimized, multi-format marketing campaign in minutes. Not hours. Not days. Minutes.

What Makes This Ahead of the Curve

As of early 2026, most companies are still using AI for one thing at a time. Generate an image here. Write some copy there. Maybe use a chatbot for customer support. The creative pipeline I built at Autohive is something different — it is a coordinated multi-agent system where LLMs, VLMs, and video generation models work together under the direction of an AI creative director, producing publication-ready content across every major platform simultaneously.

This was not a toy internal demo. A real creative agency used it during a real growth week, and the generated assets actually got posted. That changes how you think about every bug.

The technology stack — multi-modal AI orchestration, cross-agent content sharing, platform-aware content adaptation, vision-language model integration for both generation and understanding — represents where the industry is heading. Most teams are still figuring out how to make a single chatbot work reliably. We built a creative department.

6. Platform Features That Shaped the Product

Beyond integrations and agents, I built features that became part of Autohive's core platform.

The Write Tool and Copy Content Tool: Expanding the Agent's Hands

Before my changes, agents could generate text and make tool calls — but their ability to produce and manage files was limited. I expanded the Write Tool to include copy functions, which required expanding the JSON schema that defines what tools are available to agents.

The JSON schema expansion is worth understanding because it is how Autohive teaches LLMs about available tools:

{
  "name": "copy_content",
  "description": "Copy a file from the conversation to the workspace content folder",
  "parameters": {
    "type": "object",
    "properties": {
      "source_path": {
        "type": "string",
        "description": "Path to the file in the conversation context"
      },
      "destination_folder": {
        "type": "string",
        "description": "Target folder in the workspace content area"
      },
      "new_filename": {
        "type": "string",
        "description": "Optional: rename the file during copy"
      }
    },
    "required": ["source_path", "destination_folder"]
  }
}

This schema gets injected into the agent's system prompt via the ToolsSection of the SystemPromptConstructor. The LLM sees this schema and knows it can call copy_content with structured arguments. The ToolExecutor validates the arguments against the schema, executes the copy operation, and returns the result to the agent.

A deceptively simple feature with deep implications: agents can now copy files between a conversation and the workspace's content folder.

Why this matters: in a multi-agent system, one agent might generate a document during a conversation. Another agent — maybe in a different conversation, maybe triggered by a workflow — needs to use that document. Without the Copy Content Tool, the file is trapped in the conversation where it was created.

I expanded the JSON schema to support copy functions, implemented the backend logic for file transfer between conversation storage and content areas, and tested it across multiple agent configurations. This became the backbone of the creative pipeline's asset management — the Visual Agent generates an image in its conversation, copies it to the content folder, and the Platform Formatter Agent picks it up from there.

Gemini Deep Research Agent

Google's Gemini has a "Deep Research" mode that takes significantly longer than a normal query — it "cooks" for a while, doing multi-step research before producing a comprehensive result. The challenge: Autohive's task execution system expects responses in seconds, not minutes.

I designed and implemented an async pattern with background process holding. The architecture looked like this:

Standard Autohive Task Flow (< 30 seconds):
  User Message → TaskOrchestrator claims task → ModelExecutor calls LLM
    → Response streams back via SignalR → Task complete

Gemini Deep Research Flow (2-5 minutes):
  User Message → TaskOrchestrator claims task
    → Detects: model = Gemini, mode = DeepResearch
    → Spawns BackgroundResearchProcess:
        │
        ├── Opens long-lived HTTP connection to Gemini API
        ├── Gemini starts multi-step research internally:
        │     Step 1: Query decomposition
        │     Step 2: Parallel web research
        │     Step 3: Source synthesis
        │     Step 4: Report generation
        │
        ├── Progress callback every 10-15 seconds:
        │     → SignalR push: { status: "researching", step: "Analyzing sources..." }
        │     → UI shows animated progress with contextual messages
        │
        ├── Connection keepalive (prevent HTTP timeout):
        │     → Periodic heartbeat to prevent proxy/load balancer timeout
        │     → Configurable timeout ceiling (default: 5 minutes)
        │
        └── Research complete:
              → Full report delivered as conversation message
              → Task status → Complete
              → SignalR push: final result

Error handling:
  - Connection drop mid-research → retry with exponential backoff
  - Timeout exceeded → graceful degradation with partial results
  - User cancels → abort signal propagated to background process

The hardest part was the timeout management. Autohive runs behind load balancers and reverse proxies that typically kill connections after 30-60 seconds of inactivity. The Gemini Deep Research API can take 3-5 minutes. I implemented a heartbeat mechanism that sends periodic progress updates through the connection, keeping it alive while Gemini researches. The progress messages are not fake loading bars — they reflect Gemini's actual research phases, so the user sees meaningful status like "Searching for academic sources" and "Cross-referencing findings."

I wrote the implementation docs covering the async pattern in detail for future developers, pushed the code, passed local testing, and submitted the PR. The architecture doc covers the async pattern in detail for future developers.

This was one of those features where the technical challenge was not "can we call this API?" but "can we integrate an API that breaks all our architectural assumptions about response time?"

Real-Time Meeting Transcription

I built a real-time speech-to-text transcription display that works within Autohive conversations. When a meeting is being transcribed (via OpenAI's transcription API), the text appears live in the UI — word by word, as it is spoken.

The architecture is a pipeline of three async streams stitched together:

Audio Stream Pipeline:

  Microphone/Meeting Audio
    │
    ▼
  Audio Capture Layer
    → Chunking: split audio into 250ms segments
    → Encoding: convert to 16-bit PCM, 16kHz mono
    → Buffer: ring buffer to handle backpressure
    │
    ▼
  OpenAI Whisper API (Streaming)
    → Send audio chunks via WebSocket
    → Receive partial transcriptions in real-time
    → Word-level timestamps for synchronization
    │
    ▼
  SignalR Broadcast Layer
    → Push transcription fragments to conversation group
    → Deduplication: handle overlapping partial results
    → Merge partials into coherent sentences
    │
    ▼
  UI Rendering
    → Append words with typing animation
    → Speaker diarization labels (when available)
    → Auto-scroll to latest content
    → Highlight current speaker

The tricky engineering was in the partial result deduplication. Streaming transcription APIs send overlapping results — you might get "The meeting will" then "The meeting will start at" then "The meeting will start at three." Each partial supersedes the previous one. Naively appending them produces garbled duplicates. I implemented a merge algorithm that detects overlap between consecutive partials and only appends the new content, maintaining a clean transcript.

I also built the STT/TTS configuration Admin page — a full Admin UI that lets workspace admins:

Choose their transcription provider (OpenAI Whisper, with extensibility for Google Speech-to-Text, Azure)
Configure language, model quality tier (speed vs accuracy tradeoff), and streaming behavior
Set up text-to-speech for agent responses (so agents can speak, not just type)
Test the configuration with a live audio sample before deploying to production

The Sales Pitch Agent

One of the most creative agents I built: you feed it a sales call transcript — or upload one, or point it at content stored in Autohive — and it converts the conversation into a hyper-personalized sales pitch website for the customer.

The agent extracts key customer pain points, buying signals, objections raised, and specific requirements from the transcript. Then it generates a tailored pitch page that addresses everything discussed in the call — as if a sales team spent hours crafting a custom proposal.

This is the kind of agent that makes you stop and think about what software will look like in five years. Not tools that help you work — tools that do the work, informed by context you have already captured.

Google Doc to GitHub PR Agent

This was workflow automation at its best. The Autohive blog team had a manual process: write in Google Docs → convert to Hugo-compatible Markdown → open terminal → run commands → open VS Code → edit → open GitHub Desktop → create PR. Seven steps, multiple tools, easy to mess up.

I built an agent that accepts a Google Doc link and automatically:

Converts the content to Hugo-compatible Markdown
Processes images, videos, and GIFs — converting to .webp format and placing them in correct Hugo directories
Generates the required "featured blog post" image and meta details
Creates a GitHub Pull Request with the correctly formatted content

After consulting with Sumitra (who built and maintained the Autohive blog), I had the full process mapped. The agent now handles it end to end. Paste a Google Doc link, get a PR.

Productboard Assistant

Zheng needed an agent that could answer questions about the product roadmap using data from Productboard — their product management tool. Simple enough, until I discovered that the Productboard API does not provide direct access to sub-feature content or replies.

I extracted 1,400 notes and insights from Productboard to train the assistant. But the data was incomplete — sub-features were disconnected from their parent features, losing critical context about feature relationships and hierarchies.

My workaround: I built a programmatic connector that links sub-features to their parent features, reconstructing the hierarchy that the API does not directly expose. This gave the assistant full context about how features relate to each other, not just isolated data points.

7. Working with Real Clients

The difference between building a feature and shipping it to a client is the difference between cooking for yourself and cooking for a restaurant. The code has to work. The onboarding has to be smooth. The agent has to handle questions you did not anticipate.

Martin Jenkins

Martin Jenkins is a professional services firm. I worked with Nadia and Sumitra to build custom agents based on the client's specific requirements. This meant syncing with the team to understand what the client actually needed (not what we assumed they needed), translating requirements into agent configurations, and iterating until the agent handled the client's real-world scenarios.

I created the agent request document, synced with the team, and updated it with client requirements as they evolved through conversations. When the IRD finally came back with questions about how we use AI and what LLM providers we connect to, I was the one who addressed them — translating technical architecture into language that a government agency could evaluate.

Experience Collective (Rosie)

Rosie's agents needed scheduling and workflow capabilities — not just chat, but automated multi-step processes that trigger on schedules, route through approval chains, and handle dependencies between tasks. I improved two agent workflows to handle the specific sequencing and timing requirements.

True (Pegasus and Float)

For the client True, I evaluated two APIs — Pegasus and Float — for potential integration. This involved requirements gathering sessions, API documentation review, technical feasibility assessment, and a Google Meet with Nadia to clarify what the client actually needed versus what we thought they needed.

Bonnet

Met with the Bonnet client directly, understood their requirements, and built a Stripe integration specifically for their use case. Same day: client meeting in the morning, working integration by evening.

Creative Agencies (Team Orange, Growth Week)

The creative pipeline I described earlier was not built in a vacuum — it was built for Team Orange's Growth Week campaign. Real creative briefs. Real deadlines. Real social media posts that went live.

I collaborated directly with Wayne and Alex on the creative agent configurations, built agents for the Growth Week, and troubleshot issues in real time as the team used the system during the campaign.

8. The Rhythm: What a Typical Day Actually Looked Like

Most days were a mix of building, shipping, debugging, and explaining. A typical day looked something like this:

Morning: Check PRs for review comments. Address feedback — usually 2-3 PRs active at any time. Push updates.

Mid-morning: Build. This was the deep work block. A new integration, a platform feature, an agent refinement. Uninterrupted coding.

Afternoon: Client work. Meetings with Nadia or Sumitra about client requirements. Emails to clients. Agent configuration and testing.

Late afternoon: Production deployments. Push to prod, monitor for issues, verify in the live environment.

Evening: Documentation. Every integration needs docs. Every feature needs a PR description that another developer can understand. Every client agent needs setup instructions.

The pace was relentless. Looking at my daily logs, there was not a single day where I did fewer than three substantive tasks. Most days had five or six. Some days had eight.

Here are the numbers:

15+ integrations built, expanded, or shipped to production
36 files touched in the Eval Arena alone
3 marketplace submissions (Shopify, Stripe, Zoom)
1,400 notes extracted for the Productboard assistant
6+ custom agents built for specific clients
1 multi-agent creative pipeline from prompt to every social platform
~60 PRs created or contributed to
12 weeks of shipping, every single day

9. What I Actually Learned

Integration Patterns Are a Superpower

After building 15+ integrations, I can look at any API and immediately see the integration shape: What OAuth flavor do they use? What are their rate limits? How do they paginate? Where will the edge cases hide? How do I make their error messages useful to an agent?

This pattern recognition — REST vs GraphQL, OAuth 2.0 vs API keys vs custom auth, webhook-first vs polling, cursor pagination vs offset pagination — is something you only develop through volume. You cannot learn it from a tutorial. You have to build it, break it, fix it, and ship it. Fifteen times.

The Gap Between "Works" and "Shipped" Is Enormous

Code that works on your machine is 30% of the job. The other 70%:

OAuth flows that handle token expiration gracefully
Webhook endpoints that validate signatures correctly
Error messages that an LLM can reason about
Migration scripts that run cleanly in production
Documentation that another developer can follow
Compliance requirements that platform marketplaces enforce
Edge cases that only appear with real user data

The Shopify integration taught me this most viscerally. It "worked" after day one. It was "shipped" after week three.

Multi-Agent Systems Are the Future

The creative pipeline was a glimpse of what software will look like. Not single-purpose tools. Not monolithic applications. Coordinated systems of specialized AI agents, each doing what it does best, orchestrated by another AI agent that understands the overall goal.

The challenges are real — context sharing between agents, asset management, consistency across outputs, error recovery when one agent in the chain fails. But the capability is transformative. A single prompt producing a complete, platform-optimized marketing campaign is not incremental improvement. It is a category shift.

Production AI Is All About the Edges

The happy path is easy. The LLM generates a response, the tool call succeeds, the user is happy. Production AI is about everything else:

What happens when the API is down?
What happens when the model hallucinates a tool call that does not exist?
What happens when the OAuth token expires mid-conversation?
What happens when the user uploads a file format you did not anticipate?
What happens when the LLM judge disagrees with the algorithmic score?

Every edge case I encountered — and I encountered hundreds — made the system more resilient. That is the job: not building the first version, but making it survive contact with reality.

Speed Is a Skill

At Autohive, I shipped something meaningful almost every day. Not because the work was easy, but because I learned to move fast without cutting corners. Read the API docs thoroughly (saves time debugging later). Write the migration script before the feature code (avoids schema mismatches). Test with real data early (catches edge cases before PR review). Document as you build (faster than documenting after).

Speed compounds. The faster you ship, the faster you learn. The faster you learn, the faster you ship. Three months at this pace taught me more than a year of careful, measured development would have.

10. To the Next Intern

If you are about to start an internship building AI agents — or anything in this space — here is what I wish someone had told me:

The technology is moving faster than anyone can track. When I started in November 2025, Gemini Veo was brand new. By February 2026, we had it in production generating marketing videos. The window between "cutting-edge research" and "production deployment" has collapsed. Be ready to learn and ship simultaneously.

Integrations are not boring. I know, building another OAuth flow does not sound glamorous. But integrations are where AI meets the real world. Every integration is a bridge between an intelligent agent and a system full of real data, real users, and real consequences. Master integrations and you master the art of making AI useful — not just impressive.

Work with clients as early as possible. The agent that works perfectly in your testing environment will surprise you in production. Clients use your system in ways you did not imagine, ask questions you did not anticipate, and have requirements that do not fit neatly into your architecture. Every client interaction made me a better engineer.

Build systems, not features. The creative pipeline was not a feature. It was a system of agents, content management, platform adaptation, and asset orchestration. The Eval Arena was not a feature. It was a scoring engine, a judge framework, a leaderboard, and a self-improvement loop. Think in systems.

Ship every day. Not because speed is inherently good, but because shipping is how you learn. Every deployment teaches you something. Every production bug makes you more careful. Every PR review makes your code better. The cycle of build → ship → learn → improve is the entire game.

11. The HCI Scientist's Perspective: What Nobody Else in This Industry Sees

This is the point where I have to step back from pure engineering and put on a different hat.

I am not just a developer. I am a PhD researcher at the Empathic Computing Lab at the University of Auckland, where I study human cognition, affect, and collaboration in AI-mediated environments. I have co-authored papers on multimodal cognitive load assessment (CoAffinity, IEEE Transactions on Affective Computing) and empathetic conversational agents (International Journal of Human–Computer Interaction). I spend my research hours analyzing EEG signals, physiological arousal, and neural connectivity patterns to understand what happens inside the human brain when it collaborates with AI.

That dual identity — builder and researcher — gave me a perspective during this internship that I do not think many people in the agent industry have. And it fundamentally shaped how I think about everything I built.

The Cognitive Load Problem I Witnessed Firsthand

Cognitive Load Theory (Sweller, 1988) tells us that working memory can hold roughly 4 ± 1 chunks of information simultaneously. That is not a suggestion. That is a hard limit, confirmed by decades of cognitive science research and reinforced by recent findings from MIT Media Lab showing measurable "cognitive debt" from heavy AI tool usage (Kosmyna et al., 2025).

Now think about what I watched happen during Team Orange's Growth Week. The creative agents were producing marketing assets at extraordinary speed — images, videos, captions, hashtag strategies, posting schedules — across six platforms simultaneously. The output was technically excellent. But the humans reviewing it were overwhelmed.

A single campaign run produced:

Creative Pipeline Output for One Campaign Brief:
  - 6 platform-specific image sets (18+ individual images)
  - 6 platform-specific caption variants
  - 6 hashtag strategies
  - 3 short-form video clips
  - 6 posting time recommendations
  - 1 brand consistency report
  ─────────────────────────────
  Total review items: ~40 artifacts

Time for agent to produce: ~4 minutes
Time for human to meaningfully review: ~90 minutes

This is what I call the second-order outage — the system is working perfectly, but the humans operating it have collapsed. The agent succeeded at its task. The human failed at theirs. And nobody in the engineering room saw this as a problem because they were measuring agent performance, not human performance.

I started thinking about this through the lens of Information Foraging Theory (Pirolli & Card, 1999). Humans navigate information environments by following "information scent" — cues that signal where relevant content is. When the creative pipeline dumps 40 artifacts into a folder organized by platform, the information scent is weak. The reviewer has to open every file, inspect it, and mentally compare it against the brief. There is no summary. No confidence signal. No "this one needs your attention" flag.

What I Would Design Differently — And Why

If I were designing the creative pipeline again with my HCI research hat on, here is what I would change:

1. Decision-Ready Artifacts, Not Raw Output

Instead of producing 18 images and asking the human to pick, the system should present:

Top recommendation with reasoning ("This version scored highest on brand consistency and platform-specific composition")
Confidence score (0.0-1.0) so the reviewer knows when to trust and when to dig deeper
Diff highlights between variants — what actually changed, not "here are 6 things, spot the differences"

This maps to the progressive disclosure principle in HCI: show the most important information first, and let the user drill down only when needed. The three-second rule — can the user determine status and required action within three seconds? — should apply to every agent output.

2. Attention Budget Tracking

AttentionBudget(reviewer) = FocusHoursPerDay × AttentionUnitsPerHour

AttentionCost(campaign_output) = ReadTime + ComprehensionTime + DecisionTime + ActionTime

If Σ AttentionCost(all_campaigns) > AttentionBudget → Overload

The platform should track how much human review time the agents are generating, and throttle output when the review queue exceeds human capacity. This is Little's Law applied to agent systems: if the arrival rate of review items exceeds human throughput, the queue grows unbounded. By end of week, the reviewer is drowning.

3. Trust Calibration Signals

Research on human-AI trust (McGrath et al., 2025; Gerlich, 2024) shows that trust is not binary — it is a continuous variable that evolves through experience. One bad output destroys trust disproportionately:

TrustGain(good_output)  = +0.01 to +0.05  (small, incremental)
TrustLoss(bad_output)   = -0.10 to -0.40  (large, sudden)
TrustRecovery(time)     = slow, logarithmic

The creative pipeline should build trust gradually by being transparent about uncertainty. "I am 95% confident this image meets your brand guidelines" is trust-building. Silently delivering an off-brand image is trust-destroying. One wrong-color image in a campaign can undo weeks of successful outputs.

4. Cognitive Engagement Preservation

This is the finding that keeps me up at night. Zhai et al. (2024) found that heavy AI reliance leads to "diminished decision-making and critical analysis abilities." The MIT study showed measurable cognitive atrophy after just four months of LLM dependence. If the creative pipeline is too good — if it handles every aspect of the creative process without requiring human judgment — the creative team's own skills will degrade.

I do not think the job is to replace human creativity. It is to augment it without turning the human into a passive approver. That means:

Presenting choices, not conclusions (force the human to make creative decisions)
Showing the reasoning behind creative choices (engage critical thinking)
Periodically requiring manual creation (maintain creative muscles)
Tracking creative input diversity over time (are humans becoming passive acceptors?)

Why This Perspective Matters for the Industry

The agent industry is in a performance arms race. Every team is optimizing for faster responses, lower cost, better accuracy. These matter. But they are necessary conditions, not sufficient ones.

The sufficient condition is: does the human-agent system outperform either the human alone or the agent alone?

My research at the Empathic Computing Lab is moving toward measuring exactly this. In our CoAffinity dataset, we combine EEG, galvanic skin response, facial expression analysis, and behavioral metrics to build a comprehensive picture of what happens to human cognition during AI-mediated collaboration. The preliminary findings are striking: the quality of the collaboration depends as much on the interface design and information presentation as it does on the AI's capabilities.

System Performance ≠ Agent Performance

System Performance = f(Agent_Capability, Human_Cognitive_State, Interface_Design, Trust_Calibration)

An agent with 95% accuracy and a bad interface will underperform an agent with 85% accuracy and an interface designed for human cognition. I have seen this in the lab. And I saw it during my internship at Autohive.

The Bridge Between Two Worlds

Most AI engineers do not read HCI research. Most HCI researchers do not build production agent systems. I do both. And the gap between these two worlds is where most agent systems fail — not in the model, not in the architecture, not in the tool-calling accuracy, but in the space between the agent's output and the human's decision.

The internship at Autohive gave me something no research paper could: the visceral experience of watching real humans interact with real agents doing real work, under real deadlines, with real consequences. I watched the creative team during Growth Week. I watched client meetings where agents performed brilliantly — and meetings where they confused the client. I watched operators review agent outputs and I could see, from my HCI training, exactly where the cognitive load spiked, exactly where attention wandered, exactly where trust wobbled.

This dual lens — the builder who knows the technical constraints and the researcher who knows the cognitive ones — is what I bring to the agent industry. And I believe it is the perspective that matters most right now.

The teams that win the agent race will not be the ones with the best models. They will be the ones that best understand the humans using them. I have spent my career studying exactly that. And I am just getting started.

Nielsen's usability heuristics. Fitts's Law. GOMS modeling. Cognitive Load Theory. Information Foraging Theory. Trust calibration dynamics. These are not academic curiosities. They are engineering requirements for any agent system that humans need to use. The agent industry just has not read the literature yet.

I have. And I intend to build systems that prove why it matters.

Closing

Three months. Fifteen integrations. One evaluation arena. One multi-agent creative pipeline. Dozens of agents. Hundreds of production deployments. Thousands of lines of code in a platform that serves real businesses doing real work. And a research perspective that sees what the dashboards do not show.

I came to Autohive as an intern. I leave as someone who has built production AI systems and understands the human minds that use them — a combination that almost nobody in this industry has.

The AI agent revolution is not coming. It is here. It is being built, one integration at a time, one agent at a time, one production deployment at a time, by people who are willing to do the unglamorous work of making intelligent systems actually work in the real world. But the next phase — the one that separates the systems that succeed from the ones that get abandoned — will be about the human experience. About cognitive load. About trust. About designing agent output for the four slots in working memory, not for the infinite patience of a log file.

I have built the agents. I have studied the humans. And I know where the gap is.

I am just getting started.

Built at Autohive. Researched at the Empathic Computing Lab. Powered by curiosity, too much coffee, and the unshakeable belief that the future of AI is human.

Visual Gallery

Late-night AI engineering internship workflow

Production integration network across business tools

Prompt to campaign multi-agent creative pipeline

What It Is Actually Like to Build AI Agents for a Living: My 3 Months as an AI Agent Developer Intern at Autohive

By Tamil Selvan Gunasekaran

The Call That Changed Everything

I joined Autohive as an AI Agent Developer Intern to do the part I care about most: build, ship, break things, fix them, and see what survives contact with real clients.

If you are curious about what it actually feels like to stand at the frontier of AI agent development — not the conference talks, not the Twitter threads, the actual daily experience — this is that story.

1. Week One: Drinking from the Firehose

Day 1: November 17, 2025

My first day started with two meetings and a codebase that was both enormous and beautifully architected.

The first thing Ninos — the CTO — told me was: "You are not here to learn. You are here to ship." That set the tone for everything that followed.

This was not an onboarding exercise. This was production code that served paying clients. My review mattered.

The elegance of this design did not hit me immediately. It hit me two weeks later, when I built my first integration in a single day.

The Mailchimp Integration: My First Real Ship

By November 19, I was building the Mailchimp integration from scratch. Not a toy. A production integration for real clients — Nadia and Joe's clients needed email automation through their agents.

This is what "building an integration" actually meant in practice:

Research the API: Read every page of Mailchimp's developer docs. Understand rate limits, authentication flows, data models, pagination patterns, webhook capabilities. This alone took half a day.

Design the solution architecture: Which endpoints do we need? What actions should the agent be able to perform? Send emails, manage lists, create campaigns, check analytics? I documented all of it and created a formal architecture doc.

Implement the action handlers: Each action is a Python function that receives structured input from the agent, calls the Mailchimp API, and returns structured output. Error handling is not optional — a failed API call needs to produce a message the agent can reason about, not a stack trace.

Build the OAuth flow: Mailchimp uses OAuth 2.0 with some quirks. The token exchange, refresh logic, and scope management all had to work perfectly because once this ships, real users are connecting their real Mailchimp accounts.

Test with sample data: Not unit tests in isolation — actual API calls against a real Mailchimp account, verifying that emails send, lists update, and campaigns create correctly.

Document everything: Setup guides, usage instructions, error handling documentation. If another developer — or a client — cannot set this up without calling me, I failed.

I completed all seven Jira tickets (IN-99 through IN-105) for the Mailchimp integration in two days. Research, architecture, implementation, error handling, documentation — the full lifecycle.

That pace never slowed down.

2. Building Agents That Think: Tag Agency and Google Play

The Tag Agency Project: Two Agents That Read 1,800 Emails

I built two agents for him.

Agent 1: The Inbox Sorter

This agent classifies every email as either a newsletter or a real person. Sounds simple. It is not.

This is how I approached the token optimization problem:

Naive approach:
  1,800 emails × ~500 tokens/email = 900,000 input tokens
  + classification output per email = ~50 tokens × 1,800 = 90,000 output tokens
  Total: ~990,000 tokens → expensive, slow

Optimized approach:
  1. Pre-filter with regex: strip HTML, signatures, footers → reduce avg to ~200 tokens
  2. Batch classify: group 10-15 emails per LLM call with structured JSON output
  3. Cache sender patterns: once "newsletter@company.com" is classified, skip LLM for future emails from same sender
  4. Two-pass architecture:
     Pass 1: Heuristic classification (sender domain, unsubscribe headers, bulk-send headers)
       → Catches ~60% of newsletters without any LLM call
     Pass 2: LLM classification only for ambiguous emails (~720 remaining)
       → 720 emails / 12 per batch = 60 LLM calls

  Final token usage: ~180,000 tokens (82% reduction)

Agent 2: The Relationship Finder

The workflow: email content → entity extraction (brands, people, companies) → LinkedIn API lookup → connection mapping → relationship summary.

That is not automation. That is augmented relationship intelligence. And I built it in my second week.

The Google Play Review Agent: From Custom to Generic

I designed a generic architecture where:

The agent detects the business type from the app's website URL
Based on business type, it dynamically adjusts its review analysis criteria — which keywords to scan for, which sentiment patterns to flag, which issues to escalate
The UI lets the business configure their own criteria on top of the automatic detection

Six Jira tickets. All completed. New avatar created using the Avatar Creator agent — even the branding was AI-generated.

3. The Integration Factory: 15 Integrations in 12 Weeks

Between November 2025 and February 2026, I touched more integrations than I expected to in a year. This is the full map:

Integration	What I Did	Key Challenge
Mailchimp	Full build from scratch	OAuth + campaign management + credit optimization
Zoom	Full build, app submission	App marketplace compliance, meeting APIs
Shopify	GraphQL migration, app submission, compliance webhooks	REST→GraphQL migration before Feb 2026 deadline, GDPR data deletion webhooks
Stripe	Full build, marketplace submission	OAuth for marketplace, migration scripts, separate test accounts
GitHub	Expand for multiple use cases	Output schema corrections, PR automation
Google Ads	Expand actions, debug political errors	API endpoint testing, error category handling
LinkedIn	Migrate to latest SDK, expand actions, build Ads integration	SDK migration in production, Community Management API
TikTok	Full build, sandbox testing	App verification, video posting actions, server downtime handling
Microsoft	Excel, Word, PowerPoint support	Multi-file-type handling, JSON schema expansion
Uber	Integration build	Merged to production
HeyGen	OAuth update	Pushed to production
NZBN	Custom 2-legged OAuth, action expansion	Government API, formal application process
IRD	Due diligence, API sandbox application	Government compliance, due diligence requirements
Retail Express	API integration, beta testing	Client-specific API requirements
Ahref	Integration build	SEO data integration
Float	API review	Client True requirements
Productboard	Assistant agent build	1,400 notes extracted, sub-feature workaround

Each integration taught me something different. Let me walk through the ones that pushed me the hardest.

Shopify: A Masterclass in Production Engineering

The Shopify integration was the project that turned me from someone who writes code into someone who ships production software.

It started simple: expand the existing integration, test it, push it to prod. Then three things happened.

The migration was not a simple find-and-replace. REST and GraphQL have fundamentally different data-fetching philosophies:

REST (old):
  GET /admin/api/2024-01/products.json?limit=50&page=2
  → Returns fixed schema, all fields, paginated by page number
  → N+1 problem: need separate calls for variants, images, metafields

GraphQL (new):
  POST /admin/api/2024-01/graphql.json
  {
    query: products(first: 50, after: "cursor_abc") {
      edges {
        node {
          title, handle, variants(first: 10) { edges { node { price, sku } } }
        }
        cursor
      }
      pageInfo { hasNextPage }
    }
  }
  → Select exactly the fields you need
  → Nested resources in a single call
  → Cursor-based pagination (no page numbers)

The HMAC validation is worth explaining because it is a security pattern every integration developer needs to understand:

Webhook Request from Shopify:
  Headers: X-Shopify-Hmac-SHA256: <base64-encoded HMAC>
  Body: { "shop_domain": "store.myshopify.com", ... }

Validation:
  1. Extract raw request body (before any JSON parsing)
  2. Compute HMAC-SHA256 using your app's shared secret as the key
  3. Base64-encode the result
  4. Compare with the header value using constant-time comparison
     (timing-safe to prevent side-channel attacks)
  5. If mismatch → reject with 401 (someone is spoofing webhooks)
  6. If match → process the data deletion request

Without this validation, anyone who discovers your webhook URL could send fake deletion requests and wipe customer data. Security is not a feature — it is a prerequisite.

I wrote a detailed compliance document explaining every change — the HMAC validation flow, the data deletion pipeline, and the specific webhook endpoints required by Shopify's review process.

Three separate PRs. Two rounds of review comments. Multiple beta testing cycles. And then it was live, in production, handling real Shopify store data for real clients.

That is what shipping looks like.

Stripe: From Integration to Marketplace

The Stripe integration was similar in scope but different in complexity. We were not just building an integration — we were building a Stripe App that would live inside Stripe's marketplace.

This is the flow I implemented, and why each step gets tricky fast:

Stripe OAuth Flow (Connected Accounts):

1. User clicks "Connect Stripe" in Autohive
   → Redirect to: https://connect.stripe.com/oauth/authorize
     ?client_id=ca_xxx
     &response_type=code
     &scope=read_write
     &state=<CSRF_token>        ← prevents cross-site request forgery
     &redirect_uri=<callback>

2. User authorizes on Stripe's hosted page
   → Stripe redirects to callback with ?code=ac_xxx&state=<CSRF_token>

3. Server exchanges code for tokens
   POST https://connect.stripe.com/oauth/token
   {
     grant_type: "authorization_code",
     code: "ac_xxx",
     client_secret: "sk_xxx"    ← NOT the connected account's key
   }
   → Returns: { access_token, refresh_token, stripe_user_id, scope }

4. Key difference from standard OAuth:
   - No token expiration (tokens are long-lived)
   - But users can revoke via Stripe dashboard at any time
   - Must handle "account.application.deauthorized" webhook
   - The access_token IS a Stripe secret key scoped to that account
   - Must be encrypted at rest (AES-256 in our case)

5. Deauthorization webhook:
   When user disconnects → Stripe sends webhook → we revoke stored tokens
   → agent loses access gracefully (not a crash)

NZBN and IRD: Government APIs Are a Different Universe

Most API integrations follow a pattern: read docs, get API key, make calls. Government APIs do not follow patterns. They follow bureaucracy.

Building integrations with government APIs taught me that technical ability is maybe 40% of the job. The other 60% is process management, compliance, documentation, and patience.

4. The Evaluation Arena: Building AI That Judges AI

This was the project where I felt like I was building something that does not exist yet anywhere else.

The Systems I Actually Shipped

The Admin UI with Wizard-Based AI Test Generation

The wizard flow: select agent type → describe the domain → AI generates 10-20 test cases → human reviews and adjusts → tests are saved with versioned evaluation criteria.

Test Generation Strategy:
  1. Core functionality tests (40%):
     "Customer asks about return policy" — straightforward, must-pass cases

  2. Edge cases (25%):
     "Customer sends message in mixed English and Mandarin"
     "Customer pastes a 5,000-word complaint"
     "Customer asks about a product that was discontinued yesterday"

  3. Adversarial tests (20%):
     "Customer tries to get the agent to ignore its instructions"
     "Customer asks the agent to perform an action outside its scope"
     "Customer provides contradictory information across messages"

  4. Regression tests (15%):
     Generated from historical failure patterns in production logs
     "Scenarios that caused previous agent versions to hallucinate or fail"

LLM Judge Scoring and Reasoning

The judge prompt is one of the most complex prompts in the entire system. It needs to:

Judge Evaluation Pipeline:

1. Receive: {test_prompt, model_output, expected_behavior, criteria[]}

2. For each criterion (e.g., accuracy, completeness, tone):
   a. Evaluate the output against the criterion
   b. Assign a score (0-10)
   c. Write reasoning (mandatory — no score without explanation)
   d. Assign confidence (0.0-1.0)
      Low confidence triggers: ambiguous criteria, borderline quality,
      domain expertise required

3. Output structured JSON:
   {
     "criteria_scores": [
       { "criterion_code": "accuracy", "score": 8.5,
         "reasoning": "...", "confidence": 0.92 },
       ...
     ],
     "overall_assessment": "...",
     "improvement_suggestions": ["..."]
   }

4. Anti-bias measures:
   - Judge never sees which model produced the output
   - Position randomization in pairwise comparisons
   - No self-judging (model cannot judge its own outputs)
   - Periodic consistency checks (re-judge 10% of samples)

The Leaderboard

The Elo calculation follows the Bradley-Terry model:

Expected Score:
  E_A = 1 / (1 + 10^((R_B - R_A) / 400))

After a match:
  R_A_new = R_A + K × (S_A - E_A)

where:
  R_A, R_B = current ratings
  S_A = actual score (1.0 for win, 0.5 for tie, 0.0 for loss)
  K = 32 (update sensitivity — same as FIDE chess)

Win condition: model scores ≥ 7.0 on the test case
Tie condition: both models within 1.0 point of each other
Loss condition: model scores < 7.0 or significantly below opponent

Model Comparison Framework

Self-Improvement Prompts

The Bug-Fixing Marathon

5. The Creative Pipeline: From Prompt to Every Social Media Platform

This is the project that made me realize I was working on something genuinely ahead of its time.

Currently, this process involves a creative director, a graphic designer, a copywriter, a video editor, and a social media manager. Five people. Multiple rounds of revision. Days of turnaround.

What if one prompt could do it all?

The Architecture: A Multi-Agent Creative System

I did not build one agent. I built a system of agents that collaborate like a creative team.

User Prompt: "Launch campaign for summer sale, 30% off, beachwear collection"
    │
    ▼
┌─────────────────────────────────────┐
│  Creative Director Agent            │
│  (Orchestrator - LLM)               │
│                                     │
│  • Interprets the brief             │
│  • Defines visual direction         │
│  • Sets tone and style guidelines   │
│  • Delegates to specialist agents   │
└──────────────┬──────────────────────┘
               │
    ┌──────────┼──────────────┐
    ▼          ▼              ▼
┌────────┐ ┌────────────┐ ┌──────────────┐
│ Visual │ │ Copy       │ │ Video        │
│ Agent  │ │ Agent      │ │ Agent        │
│ (VLM)  │ │ (LLM)     │ │ (Gemini Veo) │
└───┬────┘ └─────┬──────┘ └──────┬───────┘
    │            │               │
    ▼            ▼               ▼
 Images      Platform-       Short-form
 in all      specific        video clips
 formats     captions &      with text
 & sizes     hashtags        overlays
    │            │               │
    └────────────┼───────────────┘
                 ▼
    ┌─────────────────────────────┐
    │  Platform Formatter Agent   │
    │                             │
    │  Instagram: 1080x1080,      │
    │    1080x1350, 1080x1920     │
    │  TikTok: 1080x1920         │
    │  LinkedIn: 1200x627        │
    │  Twitter: 1600x900         │
    │  Facebook: 1200x630        │
    │  YouTube: 1280x720 thumb   │
    │                             │
    │  + Platform-specific copy   │
    │  + Hashtag strategies       │
    │  + Posting time suggestions │
    └──────────────┬──────────────┘
                   ▼
            Content Folder
         (All assets saved,
          organized by platform)

The Visual Agent: LLMs Meet Vision Language Models

The pipeline uses multiple AI capabilities in sequence:

Prompt Refinement (LLM): The raw creative brief is expanded into a detailed image generation prompt. The LLM understands marketing principles — it knows that a summer sale needs warm colors, outdoor settings, lifestyle imagery. It crafts prompts that will produce marketing-effective visuals, not just technically correct ones.

Image Generation (VLM): The refined prompt is sent to a vision-language model for image generation. The system supports multiple generation backends — depending on the style required, it might use different models for photorealistic vs. illustrated vs. abstract styles.

Image Understanding and Adaptation (VLM): Once the base image is generated, a VLM analyzes it — composition, color palette, text placement opportunities, brand consistency. This analysis feeds into the platform adaptation step.

Multi-Format Output: The system generates the image in every format needed — square for Instagram feed, portrait for Stories and Reels, landscape for LinkedIn and Twitter, thumbnail for YouTube. This is more than cropping. The composition has to shift so the focal point does not die when the aspect ratio changes.

The Copy Agent: Platform-Specific Intelligence

The Copy Agent does not write generic captions. It writes platform-native content.

For Instagram: visual-first language, strategic hashtag sets (mix of high-volume and niche), emoji usage that matches the brand voice, call-to-action in the last line.

For LinkedIn: professional tone, industry-relevant framing, longer-form storytelling, strategic use of line breaks for the "see more" click optimization.

For TikTok: casual, trend-aware language, hook in the first line, relevant trending sounds suggested, hashtag strategy optimized for the For You Page algorithm.

For Twitter/X: punchy, under 280 characters, thread structure for longer content, no hashtag spam.

The Creative Director Agent: Orchestration Is the Hard Part

The Creative Director Agent:

Interprets vague briefs ("make something cool for our summer launch") into actionable creative specifications
Defines the visual language — color palettes, typography direction, imagery style, mood
Coordinates between the Visual, Copy, and Video agents — ensuring brand consistency across all outputs
Reviews generated assets (using VLM capabilities) and requests revisions if they do not meet the brief
Packages the final deliverables in a structured format organized by platform

By the time Growth Week launched, the pipeline could take a single creative brief and produce a complete, platform-optimized, multi-format marketing campaign in minutes. Not hours. Not days. Minutes.

What Makes This Ahead of the Curve

This was not a toy internal demo. A real creative agency used it during a real growth week, and the generated assets actually got posted. That changes how you think about every bug.

6. Platform Features That Shaped the Product

Beyond integrations and agents, I built features that became part of Autohive's core platform.

The Write Tool and Copy Content Tool: Expanding the Agent's Hands

The JSON schema expansion is worth understanding because it is how Autohive teaches LLMs about available tools:

{
  "name": "copy_content",
  "description": "Copy a file from the conversation to the workspace content folder",
  "parameters": {
    "type": "object",
    "properties": {
      "source_path": {
        "type": "string",
        "description": "Path to the file in the conversation context"
      },
      "destination_folder": {
        "type": "string",
        "description": "Target folder in the workspace content area"
      },
      "new_filename": {
        "type": "string",
        "description": "Optional: rename the file during copy"
      }
    },
    "required": ["source_path", "destination_folder"]
  }
}

A deceptively simple feature with deep implications: agents can now copy files between a conversation and the workspace's content folder.

Gemini Deep Research Agent

I designed and implemented an async pattern with background process holding. The architecture looked like this:

Standard Autohive Task Flow (< 30 seconds):
  User Message → TaskOrchestrator claims task → ModelExecutor calls LLM
    → Response streams back via SignalR → Task complete

Gemini Deep Research Flow (2-5 minutes):
  User Message → TaskOrchestrator claims task
    → Detects: model = Gemini, mode = DeepResearch
    → Spawns BackgroundResearchProcess:
        │
        ├── Opens long-lived HTTP connection to Gemini API
        ├── Gemini starts multi-step research internally:
        │     Step 1: Query decomposition
        │     Step 2: Parallel web research
        │     Step 3: Source synthesis
        │     Step 4: Report generation
        │
        ├── Progress callback every 10-15 seconds:
        │     → SignalR push: { status: "researching", step: "Analyzing sources..." }
        │     → UI shows animated progress with contextual messages
        │
        ├── Connection keepalive (prevent HTTP timeout):
        │     → Periodic heartbeat to prevent proxy/load balancer timeout
        │     → Configurable timeout ceiling (default: 5 minutes)
        │
        └── Research complete:
              → Full report delivered as conversation message
              → Task status → Complete
              → SignalR push: final result

Error handling:
  - Connection drop mid-research → retry with exponential backoff
  - Timeout exceeded → graceful degradation with partial results
  - User cancels → abort signal propagated to background process

This was one of those features where the technical challenge was not "can we call this API?" but "can we integrate an API that breaks all our architectural assumptions about response time?"

Real-Time Meeting Transcription

The architecture is a pipeline of three async streams stitched together:

Audio Stream Pipeline:

  Microphone/Meeting Audio
    │
    ▼
  Audio Capture Layer
    → Chunking: split audio into 250ms segments
    → Encoding: convert to 16-bit PCM, 16kHz mono
    → Buffer: ring buffer to handle backpressure
    │
    ▼
  OpenAI Whisper API (Streaming)
    → Send audio chunks via WebSocket
    → Receive partial transcriptions in real-time
    → Word-level timestamps for synchronization
    │
    ▼
  SignalR Broadcast Layer
    → Push transcription fragments to conversation group
    → Deduplication: handle overlapping partial results
    → Merge partials into coherent sentences
    │
    ▼
  UI Rendering
    → Append words with typing animation
    → Speaker diarization labels (when available)
    → Auto-scroll to latest content
    → Highlight current speaker

I also built the STT/TTS configuration Admin page — a full Admin UI that lets workspace admins:

Choose their transcription provider (OpenAI Whisper, with extensibility for Google Speech-to-Text, Azure)
Configure language, model quality tier (speed vs accuracy tradeoff), and streaming behavior
Set up text-to-speech for agent responses (so agents can speak, not just type)
Test the configuration with a live audio sample before deploying to production

The Sales Pitch Agent

Google Doc to GitHub PR Agent

I built an agent that accepts a Google Doc link and automatically:

Converts the content to Hugo-compatible Markdown
Processes images, videos, and GIFs — converting to .webp format and placing them in correct Hugo directories
Generates the required "featured blog post" image and meta details
Creates a GitHub Pull Request with the correctly formatted content

After consulting with Sumitra (who built and maintained the Autohive blog), I had the full process mapped. The agent now handles it end to end. Paste a Google Doc link, get a PR.

Productboard Assistant

7. Working with Real Clients

Martin Jenkins

Experience Collective (Rosie)

True (Pegasus and Float)

Bonnet

Creative Agencies (Team Orange, Growth Week)

8. The Rhythm: What a Typical Day Actually Looked Like

Most days were a mix of building, shipping, debugging, and explaining. A typical day looked something like this:

Morning: Check PRs for review comments. Address feedback — usually 2-3 PRs active at any time. Push updates.

Mid-morning: Build. This was the deep work block. A new integration, a platform feature, an agent refinement. Uninterrupted coding.

Afternoon: Client work. Meetings with Nadia or Sumitra about client requirements. Emails to clients. Agent configuration and testing.

Late afternoon: Production deployments. Push to prod, monitor for issues, verify in the live environment.

Evening: Documentation. Every integration needs docs. Every feature needs a PR description that another developer can understand. Every client agent needs setup instructions.

The pace was relentless. Looking at my daily logs, there was not a single day where I did fewer than three substantive tasks. Most days had five or six. Some days had eight.

Here are the numbers:

15+ integrations built, expanded, or shipped to production
36 files touched in the Eval Arena alone
3 marketplace submissions (Shopify, Stripe, Zoom)
1,400 notes extracted for the Productboard assistant
6+ custom agents built for specific clients
1 multi-agent creative pipeline from prompt to every social platform
~60 PRs created or contributed to
12 weeks of shipping, every single day

9. What I Actually Learned

Integration Patterns Are a Superpower

The Gap Between "Works" and "Shipped" Is Enormous

Code that works on your machine is 30% of the job. The other 70%:

OAuth flows that handle token expiration gracefully
Webhook endpoints that validate signatures correctly
Error messages that an LLM can reason about
Migration scripts that run cleanly in production
Documentation that another developer can follow
Compliance requirements that platform marketplaces enforce
Edge cases that only appear with real user data

The Shopify integration taught me this most viscerally. It "worked" after day one. It was "shipped" after week three.

Multi-Agent Systems Are the Future

Production AI Is All About the Edges

The happy path is easy. The LLM generates a response, the tool call succeeds, the user is happy. Production AI is about everything else:

What happens when the API is down?
What happens when the model hallucinates a tool call that does not exist?
What happens when the OAuth token expires mid-conversation?
What happens when the user uploads a file format you did not anticipate?
What happens when the LLM judge disagrees with the algorithmic score?

Every edge case I encountered — and I encountered hundreds — made the system more resilient. That is the job: not building the first version, but making it survive contact with reality.

Speed Is a Skill

10. To the Next Intern

If you are about to start an internship building AI agents — or anything in this space — here is what I wish someone had told me:

11. The HCI Scientist's Perspective: What Nobody Else in This Industry Sees

This is the point where I have to step back from pure engineering and put on a different hat.

The Cognitive Load Problem I Witnessed Firsthand

A single campaign run produced:

Creative Pipeline Output for One Campaign Brief:
  - 6 platform-specific image sets (18+ individual images)
  - 6 platform-specific caption variants
  - 6 hashtag strategies
  - 3 short-form video clips
  - 6 posting time recommendations
  - 1 brand consistency report
  ─────────────────────────────
  Total review items: ~40 artifacts

Time for agent to produce: ~4 minutes
Time for human to meaningfully review: ~90 minutes

What I Would Design Differently — And Why

If I were designing the creative pipeline again with my HCI research hat on, here is what I would change:

1. Decision-Ready Artifacts, Not Raw Output

Instead of producing 18 images and asking the human to pick, the system should present:

Top recommendation with reasoning ("This version scored highest on brand consistency and platform-specific composition")
Confidence score (0.0-1.0) so the reviewer knows when to trust and when to dig deeper
Diff highlights between variants — what actually changed, not "here are 6 things, spot the differences"

2. Attention Budget Tracking

AttentionBudget(reviewer) = FocusHoursPerDay × AttentionUnitsPerHour

AttentionCost(campaign_output) = ReadTime + ComprehensionTime + DecisionTime + ActionTime

If Σ AttentionCost(all_campaigns) > AttentionBudget → Overload

3. Trust Calibration Signals

TrustGain(good_output)  = +0.01 to +0.05  (small, incremental)
TrustLoss(bad_output)   = -0.10 to -0.40  (large, sudden)
TrustRecovery(time)     = slow, logarithmic

4. Cognitive Engagement Preservation

I do not think the job is to replace human creativity. It is to augment it without turning the human into a passive approver. That means:

Presenting choices, not conclusions (force the human to make creative decisions)
Showing the reasoning behind creative choices (engage critical thinking)
Periodically requiring manual creation (maintain creative muscles)
Tracking creative input diversity over time (are humans becoming passive acceptors?)

Why This Perspective Matters for the Industry

The agent industry is in a performance arms race. Every team is optimizing for faster responses, lower cost, better accuracy. These matter. But they are necessary conditions, not sufficient ones.

The sufficient condition is: does the human-agent system outperform either the human alone or the agent alone?

System Performance ≠ Agent Performance

System Performance = f(Agent_Capability, Human_Cognitive_State, Interface_Design, Trust_Calibration)

The Bridge Between Two Worlds

The teams that win the agent race will not be the ones with the best models. They will be the ones that best understand the humans using them. I have spent my career studying exactly that. And I am just getting started.

I have. And I intend to build systems that prove why it matters.

Closing

I came to Autohive as an intern. I leave as someone who has built production AI systems and understands the human minds that use them — a combination that almost nobody in this industry has.

I have built the agents. I have studied the humans. And I know where the gap is.

I am just getting started.

Built at Autohive. Researched at the Empathic Computing Lab. Powered by curiosity, too much coffee, and the unshakeable belief that the future of AI is human.

T.S

What It Is Actually Like to Build AI Agents for a Living

What It Is Actually Like to Build AI Agents for a Living: My 3 Months as an AI Agent Developer Intern at Autohive

The Call That Changed Everything

1. Week One: Drinking from the Firehose

Day 1: November 17, 2025

The Mailchimp Integration: My First Real Ship

2. Building Agents That Think: Tag Agency and Google Play

The Tag Agency Project: Two Agents That Read 1,800 Emails

The Google Play Review Agent: From Custom to Generic

3. The Integration Factory: 15 Integrations in 12 Weeks

Shopify: A Masterclass in Production Engineering

Stripe: From Integration to Marketplace

NZBN and IRD: Government APIs Are a Different Universe

4. The Evaluation Arena: Building AI That Judges AI

The Systems I Actually Shipped

The Bug-Fixing Marathon

5. The Creative Pipeline: From Prompt to Every Social Media Platform

The Architecture: A Multi-Agent Creative System

The Visual Agent: LLMs Meet Vision Language Models

The Copy Agent: Platform-Specific Intelligence

The Creative Director Agent: Orchestration Is the Hard Part

What Makes This Ahead of the Curve

6. Platform Features That Shaped the Product

The Write Tool and Copy Content Tool: Expanding the Agent's Hands

Gemini Deep Research Agent

Real-Time Meeting Transcription

The Sales Pitch Agent

Google Doc to GitHub PR Agent

Productboard Assistant

7. Working with Real Clients

Martin Jenkins

Experience Collective (Rosie)

True (Pegasus and Float)

Bonnet

Creative Agencies (Team Orange, Growth Week)

8. The Rhythm: What a Typical Day Actually Looked Like

9. What I Actually Learned

Integration Patterns Are a Superpower

The Gap Between "Works" and "Shipped" Is Enormous

Multi-Agent Systems Are the Future

Production AI Is All About the Edges

Speed Is a Skill

10. To the Next Intern

11. The HCI Scientist's Perspective: What Nobody Else in This Industry Sees

The Cognitive Load Problem I Witnessed Firsthand

What I Would Design Differently — And Why

Why This Perspective Matters for the Industry

The Bridge Between Two Worlds

Closing

Visual Gallery

Related Posts

How to Build an LLM Evaluation System: A Deep Technical Dive

Part 1: The Crash - How I Lost My Health and Found My Why

Part 2: The Grind - Losing 50kg and Building an Athlete

What It Is Actually Like to Build AI Agents for a Living

What It Is Actually Like to Build AI Agents for a Living: My 3 Months as an AI Agent Developer Intern at Autohive

The Call That Changed Everything

1. Week One: Drinking from the Firehose

Day 1: November 17, 2025

The Mailchimp Integration: My First Real Ship

2. Building Agents That Think: Tag Agency and Google Play

The Tag Agency Project: Two Agents That Read 1,800 Emails

The Google Play Review Agent: From Custom to Generic

3. The Integration Factory: 15 Integrations in 12 Weeks

Shopify: A Masterclass in Production Engineering

Stripe: From Integration to Marketplace

NZBN and IRD: Government APIs Are a Different Universe

4. The Evaluation Arena: Building AI That Judges AI

The Systems I Actually Shipped

The Bug-Fixing Marathon

5. The Creative Pipeline: From Prompt to Every Social Media Platform

The Architecture: A Multi-Agent Creative System

The Visual Agent: LLMs Meet Vision Language Models

The Copy Agent: Platform-Specific Intelligence

The Creative Director Agent: Orchestration Is the Hard Part

What Makes This Ahead of the Curve

6. Platform Features That Shaped the Product

The Write Tool and Copy Content Tool: Expanding the Agent's Hands

Gemini Deep Research Agent