Claude-Powered AI Assistants in Production: What We Built Into Our Operations Platform
Three Claude-powered AI assistants are in daily production use inside NOW Media's operations command center: client health summarizer, follow-up email drafter, and suggest-next-step. Here's exactly what each does, the prompt structures behind them, why we chose Claude over GPT-4 for these specific tasks, and what we learned about deploying AI inside a real business workflow (not a demo).
**Three Claude-powered AI assistants are in daily production use inside NOW Media's operations command center. They are not chat interfaces. They are not "ask me anything" widgets. They are workflow-embedded assistants that draft a single specific output at the exact moment a human needs it, client health summary on the client detail view, follow-up email when a lead goes stale, suggest-next-step when a project transitions phases.**
This is the design pattern that actually works for AI in business software in 2026: not chatbots, not autonomous agents, not generative-AI dashboards. **Inline assistants that produce one specific output, embedded at the decision point where a human would otherwise spend 5 minutes thinking and writing.**
Here's what each does, the prompt structures behind them, why we picked Claude over GPT-4 for these tasks, and what we learned about deploying AI inside a real business workflow rather than building another demo.
— — —
The three assistants
1. Client health summarizer
Where it lives: every client detail view inside the command center.
What it produces: a single-paragraph status read (~80 words) summarizing the client relationship's current state, last touch, active projects, payment status, sentiment from recent email threads, suggested risk level.
**What it costs to run**: ~$0.003 per generation. Generated on-demand, not pre-cached.
Example output (anonymized):
Client X has been engaged since March 2025 across two projects, brand identity (delivered, paid) and ongoing content retainer (active, current month invoice unpaid 8 days). Last touch was a Slack message from their founder on May 12 asking about retainer scope expansion. Sentiment across the last 5 emails has been positive, with one neutral exchange about timeline slippage on a content batch. Risk level: low. Suggested action: confirm scope expansion conversation, follow up on outstanding invoice gently in same thread.
Why it works: the ops lead opens the client view 15+ times a day across active clients. Without the summarizer, they reconstruct context from email threads, Slack history, and project notes, roughly 3 minutes per client. With the summarizer, context loads in 4 seconds and the ops lead reads/scans for 30 seconds before acting. Net savings: ~37 minutes a day for one role.
2. Follow-up email drafter
Where it lives: triggered on a stale lead (no contact in N days) inside the sales pipeline view.
What it produces: a complete contextual follow-up email, not a template. Personalized to the specific lead, references the last conversation, suggests a concrete next step.
**What it costs to run**: ~$0.008 per generation. Generated on-demand.
Example output (anonymized):
Subject: Re: Brand strategy + website scope
>
Hi [Name],
>
Following up on our March 30 call about the brand strategy + website combined scope. You mentioned wanting to wait until the Q1 results came in before deciding direction, happy to find out where that landed.
>
No pressure if the timing has shifted. If it's still on the table, I have 30 min open on Tuesday or Thursday next week, would that work for either?
>
Best, [Sender]
Why it works: the BD lead has roughly 8 stale leads warranting follow-up per week. Without the drafter, each follow-up takes 5-8 minutes (open the lead, re-read context, write the email, second-guess tone, send). With the drafter, the BD lead reads the draft (60 seconds), edits one or two phrases (60 seconds), sends. Net time per follow-up: 2 minutes vs 7. Net savings: ~40 minutes per week per BD lead.
**Critical design choice**: the drafter never sends. It only drafts. The send button is human-gated. This single design constraint is the difference between an AI tool that gets used and an AI tool that gets disabled after one bad outbound.
3. Suggest-next-step
Where it lives: every project detail view + every won-deal handoff state.
What it produces: a single concrete action recommendation, "send the phase 2 invoice for project Y," "schedule the kickoff call with client Z," "spin up the handoff doc from BD to delivery for deal A."
**What it costs to run**: ~$0.002 per generation. Triggered on view, cached for 4 hours per project.
Example output:
Project: NovaMart Retail brand identity (Phase 2 of 4)
>
Next step: Send Phase 2 invoice (₹6L, due on phase-2 deliverable acceptance)
>
**Why now**: Phase 2 deliverables marked complete by the design lead on May 14. Client acknowledged receipt in Slack on May 15 ("looks great"). Invoice not yet generated. Phase 3 design work is currently 3 days into start, billing should not lag delivery by more than 1 week per our policy.
Why it works: project state transitions are where money leaks. Phase 2 finishes, phase 3 starts, the invoice for phase 2 sits in someone's mental queue for two weeks. Suggest-next-step surfaces the missed action at the moment the ops lead opens the project view. It's not autonomous, it doesn't generate the invoice, but it surfaces the gap precisely when attention is on the project.
— — —
Why Claude over GPT-4 for these specific tasks
We tested all three assistants against both Anthropic's Claude (Sonnet 4.5 + Opus 4.6 + Opus 4.7 + Sonnet 4.6) and OpenAI's GPT-4 series (GPT-4o + GPT-4.1 + o1 variants) over six weeks during the build. Final choice: Claude (current production: Opus 4.7 for the client health summarizer and follow-up drafter, Sonnet 4.6 for suggest-next-step).
**Why Claude won on judgment-heavy tasks**:
For the client health summarizer specifically, the inputs are messy: 6+ email threads, Slack history, project notes, payment status, sentiment cues from informal language. The output is a single coherent paragraph that has to weigh competing signals and land on a defensible read.
Claude produced more nuanced and useful reads on judgment-heavy edge cases:
- A client who's late on payments but has high engagement (Claude correctly flagged "follow up gently in next touch, do not lead with the invoice", GPT-4 defaulted to "remind about overdue invoice")
- A formerly active client who's gone quiet for 6 weeks but the last touch was positive (Claude correctly flagged "warm but cooling, suggested re-engagement now", GPT-4 defaulted to "stale lead, mark for cleanup")
- A complaint that was resolved but never explicitly closed (Claude correctly flagged "complaint resolved per follow-up thread, no further action needed", GPT-4 included it as an active issue)
The pattern: when the judgment required weighing competing signals where the right answer is contextual, Claude's outputs were more useful and required less editing.
**Where GPT-4 was equivalent or better**:
For pure data extraction and classification, "given this email, extract the date proposed for the next meeting" or "classify this lead as enterprise/SMB/individual", both models performed equivalently. For tasks requiring deep knowledge of specific Indian business contexts (GST nuances, regional payment patterns), neither was particularly strong; both produced surface-level answers that required validation.
**Cost-quality ratio at production scale**:
| Model | $/1M input tokens | $/1M output tokens | Use case |
| Claude Opus 4.7 | ~$15 | ~$75 | Client health summary, follow-up drafter |
| Claude Sonnet 4.6 | ~$3 | ~$15 | Suggest-next-step |
| GPT-4.1 | ~$10 | ~$30 | Tested, not in production |
For NOW Media's workload (~2,000 assistant invocations per month across all three flows), the total Claude monthly cost is roughly $40. The judgment-quality difference justifies the higher Opus rate for the two heavy flows.
Why we re-evaluate quarterly: model performance is a moving target. Anthropic and OpenAI both ship new model versions every 2-3 months. We re-run the eval suite quarterly and switch if the cost-quality math changes. Standing recommendation: don't lock to one model. Build your prompt layer such that swapping providers takes one config change.
— — —
The prompt structures (production templates)
Client health summarizer
The full system prompt is ~600 words. Excerpted structure:
You are the client health summarizer for NOW Media, a creative
studio in Bangalore. You produce a single-paragraph status read
of a client relationship based on structured data + recent
communication.
Output format: one paragraph, 70-100 words. No bullet points.
No headers. Plain prose.
Input you'll receive:
1. Client metadata (name, engagement start date, active projects,
payment status)
2. Last 6 email threads (subject, date, summary, sentiment)
3. Last 30 days of Slack messages from/to the client's channel
4. Notes from the last 3 internal sync meetings
Output structure (always in this order):
- Engagement length + active projects/products in one sentence
- Last meaningful touch (with date)
- Current state assessment (positive / neutral / cooling / at-risk)
- One concrete suggested action
Tone: factual, slightly clinical. Useful for someone scanning
quickly between client views. No filler ("This client has been...").
Start with the substantive content.
Edge cases:
- If the client has no recent activity, lead with "Inactive since
[date]" not "Currently active", never sugar-coat.
- If payment is overdue, mention it without leading with it unless
it's the dominant signal.
- If sentiment is mixed across threads, weigh by recency.
Do not hedge. Do not include disclaimers about your inability to
know things. If a signal is ambiguous, name the ambiguity in one
clause.
The key design: **specific format constraints + explicit edge case handling + tone instruction**. The output prompt isn't generated, it's structured to match a specific UI slot where the ops lead expects 80 words of plain prose every time.
Follow-up email drafter
Production prompt is ~450 words. Key elements:
- **Persona definition**: "You are drafting a follow-up email on behalf of [BD lead name] at NOW Media. Match their writing voice from the example threads provided. Indian English conventions. Slightly informal warmth."
- **Hard constraints**: "Subject line: under 60 characters. Body: 4 sentences maximum. One clear ask. No links unless one is provided in input. Sign off with the sender's first name only."
- Edge case rules: "If the lead's last response indicated 'not now', do not propose a meeting in the next 2 weeks, propose checking in next quarter. If the lead has never responded, write a shorter re-engagement note, not a longer follow-up."
The drafter has access to: lead metadata, last 3 communication touches, NOW Media service summary, calibration call slots (from calendar integration).
Suggest-next-step
Production prompt is ~300 words. Key elements:
- Output format: "One next step. One sentence. Plus 2-3 sentences explaining why now, including any deadline or trigger event."
- **Decision frame**: "You are deciding what the ops lead should do *next* on this project. Choose the action that, if skipped, has the biggest downstream cost or compounds delay. Default to revenue-protecting actions over relationship-maintaining actions when in conflict."
- **Available actions are constrained** to a list of 12 operationally meaningful options (send invoice, schedule call, draft contract, send deliverable for review, follow up on approval, etc.), the assistant doesn't invent novel actions.
The constraint to a fixed action list is critical. Open-ended next-step suggestions tend to drift into recommendations like "consider scheduling a check-in to align on roadmap" that aren't actionable. Constraining to specific verbs makes the output useful.
— — —
What we learned about deploying AI inside business workflows
Lesson 1: Inline beats chat
The strongest design pattern is embedding AI at the moment a human needs the output, not adding a chat interface where they have to think to ask. The ops lead doesn't think "I should ask the AI about Client X's health", they think "I need to know what's going on with Client X." The summarizer is right there on the client view; the ops lead reads it as naturally as they'd read any other UI element.
A chat interface would have required: open the chat, type the query, wait for response, parse response. Friction at every step. Adoption would have died.
Lesson 2: AI drafts, humans send
Every assistant in production produces *a draft*, not an action. The follow-up email drafter never sends the email, the human clicks send. Suggest-next-step never executes the action, the human triggers it. The client health summarizer is read-only.
This boundary is non-negotiable in our setup. The moment AI gets autonomous send/execute power, two things happen: (a) the team starts treating it with the seriousness of a human action and second-guesses every draft, slowing the workflow rather than speeding it up; and (b) any one bad outbound (wrong tone, wrong recipient, wrong amount) causes immediate AI rollback regardless of average quality.
The draft + human-send pattern keeps both speed and accountability.
Lesson 3: Specific format constraints beat open-ended prompts
The single biggest quality lift came from constraining outputs to specific formats. The client health summarizer produces exactly 70-100 words. Always. Plain prose. No bullets. No headers. Output that violates the format gets re-prompted automatically.
This sounds restrictive but it's not. The output gets read in one specific UI slot. The slot has a specific size. Variable-length outputs created visual noise and broke the scanning pattern. The constrained format made the assistant feel reliable, every time you look, you see exactly what you expected.
Lesson 4: Cache aggressively, regenerate only on real change
Client health summaries don't change every 5 minutes. We cache for 4 hours, invalidate on any new email/Slack/project activity for that client. Suggest-next-step caches similarly. Net effect: ~80% of assistant invocations hit cache, the actual model calls are ~400 per month not 2,000.
Lesson 5: Measure what you saved, not what you generated
We track two metrics:
- Time saved per ops role (estimated weekly via short survey, validated quarterly)
- **Actions completed within 24h of suggestion** (concrete proxy for whether suggest-next-step is actually useful)
We do not track "AI invocations per day" or "prompts run." Those are vanity. The question that matters: would the team revert to manual if the assistants disappeared tomorrow? Today, the answer is no, they'd struggle. That's the signal we built something useful.
Lesson 6: Quarterly model re-eval
Model performance shifts every 2-3 months. New Opus releases, new GPT-4 versions, new Gemini variants. We run the eval suite quarterly: same 30 representative inputs, run through current production model + 2-3 alternatives, score by ops lead blind review. Swap providers if a meaningfully better cost-quality option emerges.
The eval suite is the most valuable asset in the AI stack, more valuable than any single model choice. It lets you swap providers without losing operational continuity.
— — —
What we deliberately did NOT build
A few patterns we considered and rejected:
- **General-purpose CRM chatbot.** "Ask me anything about your pipeline." Felt useful in demos. In testing, the ops team asked 2 questions in week one and stopped. Without a specific workflow trigger, AI tools don't get used.
- **Auto-categorization of incoming leads.** Sounded helpful. In practice, lead categorization is high-stakes (mis-categorized leads get the wrong follow-up flow) and humans were faster + more accurate than AI on our specific category taxonomy.
- **Predictive churn scoring.** ML model that predicts which clients will lapse. Tested for 8 weeks. Predictive accuracy was decent (AUC ~0.74) but the actions the team took were the same regardless of the score, they reached out to clients showing the signals the model already surfaced. Net effect on retention: zero. Dropped.
- **Auto-generated case study drafts.** Plausible idea. In practice, case studies are the most public-facing surface of the brand and the team rejected AI-drafted versions in every test cycle as "off" in subtle ways. We use Claude to outline case studies but the writing is human.
The pattern across what didn't work: AI tools that require humans to *trust* the output to act on it tend to fail (because humans don't trust AI judgment yet), while AI tools that produce drafts humans review and send tend to succeed.
— — —
What this means for clients
We build AI assistants for clients as part of the AI Automation service. The same patterns apply:
- **Inline at the decision point, not in a chat**
- **AI drafts, humans send/execute**
- **Constrained output format**
- **Aggressive caching**
- **Eval suite for quarterly provider re-evaluation**
- **Measure time saved or actions completed, not invocations**
Typical client builds: a single domain-specific assistant (e.g., client-call follow-up drafter for a sales team, or invoice-generation drafter for accounting, or brand-voice checker for content teams). Engagement runs ₹5L to ₹30L over 4 to 16 weeks depending on integration complexity.
— — —
FAQ
Why three assistants and not one general-purpose one?
Each assistant solves one specific operational problem at the exact moment a human encounters it. A general-purpose assistant ("ask anything about your business") sounds powerful in demos but doesn't get used because there's no specific workflow trigger. Three specialized assistants embedded in three specific UI moments get used 20-40 times per day per role. One general assistant got used 2 times in week one of testing and then stopped.
What's the data we send to Claude?
Each assistant sends the minimum data needed for its specific task. Client health summarizer: client metadata + last 6 email subjects + summaries + sentiment scores (not full email body) + project status + payment status. Follow-up drafter: lead metadata + last 3 communication touches + service summary. Suggest-next-step: project metadata + recent activity log + deadline calendar. No full email content, no Slack content beyond summarized signals, no personally identifying client data beyond names and company.
What's your fallback when Claude is down?
The assistants degrade gracefully. If the API is unavailable, the UI shows the underlying data without the assistant overlay, same as the pre-assistant version of the workflow. We've had ~3 Claude API outages in the past 12 months, each under 90 minutes. The ops team noticed but operations continued. Critical: assistants should be productivity enhancers, not workflow dependencies.
How do you handle the assistants getting things wrong?
Two mechanisms. First, every assistant output has a "this was wrong" thumbs-down button that logs the input + output + reason to a feedback table we review weekly. Second, the team has been trained that AI drafts are starting points, never assume the draft is correct, always read it before sending. We measure "edit rate" (how often the human modifies the draft before sending) as a signal: high edit rate on a specific prompt means the prompt needs work.
Why not use Anthropic's Workflows or OpenAI's Assistants API directly?
We tested both. The platform APIs are great for one-off chatbot builds but have hidden lock-in: their thread/state model makes provider swaps painful. We built a thin internal abstraction layer that handles prompt construction + context assembly + caching + provider routing. Net effect: ~150 lines of TypeScript. Swap a provider with one config change. Worth it for production.
How much engineering effort to build something equivalent?
Per assistant, roughly: 1 week to scope the workflow + decide constraints, 1 week to write/test/iterate prompts, 1 week to wire into the UI + caching layer, 1 week of internal beta + iteration. Total ~1 person-month per production assistant. The first one is the heaviest (because you're building the abstraction layer). Subsequent ones drop to ~2-3 weeks.
What models do you recommend for similar setups in 2026?
Default recommendation: Claude Opus 4.7 for judgment-heavy tasks (summarization with weighted signals, drafting that requires tone matching, recommendation with competing factors), Claude Sonnet 4.6 for high-volume lower-stakes tasks (classification, extraction, formatting). Re-evaluate quarterly. Build provider-agnostic so swaps are cheap.
— — —
*The operations command center NOW Media runs on, including the three Claude-powered assistants described here, is being productized for external availability. Available today as a custom build under the AI Automation service (₹15L to ₹30L, 8 to 16 weeks). Start your scope or read about Brandauditor.ai's first ChatGPT citation in 2 days. NOW Media is a Bangalore creative studio founded in 2019 by Nithin Koshy and Divya Maben, a brand of Bleep Design Private Limited.*
Internal links
- Pillar: /services/ai-automation
- Related: /products
- Related: /methodology
- Related article:
/thinking/why-we-built-bleep-instead-of-buying-a-crm - Related article:
/thinking/how-brandauditor-hit-first-chatgpt-citation-in-2-days - Related article:
/thinking/indian-fiscal-year-problem-with-generic-saas - Scope CTA: /#scope
- Author byline: /about#nithin-koshy