Voice AI Agent Engineering — Complete Design, Build & Deploy System
> Build production-grade AI voice agents for phone calls, customer service, sales, and automation. Platform-agnostic methodology covering conversation design, voice UX, telephony integration, and scal
by 1kalin · published 2026-04-01
$ claw add gh:1kalin/1kalin-afrexai-voice-ai-engine# Voice AI Agent Engineering — Complete Design, Build & Deploy System
> Build production-grade AI voice agents for phone calls, customer service, sales, and automation. Platform-agnostic methodology covering conversation design, voice UX, telephony integration, and scaling.
---
Phase 1: Voice Agent Strategy & Use Case Selection
Voice Agent Brief
voice_agent_brief:
project_name: ""
business_objective: "" # What outcome does this agent drive?
use_case_type: "" # inbound_support | outbound_sales | appointment_booking | notification | survey | ivr_replacement | concierge | internal_ops
target_audience: "" # Who will talk to this agent?
call_volume_estimate: "" # calls/day expected
avg_call_duration: "" # target minutes
languages: [] # primary + secondary
success_metrics: [] # CSAT, resolution rate, booking rate, etc.
human_fallback: "" # when and how to escalate
compliance_requirements: [] # TCPA, GDPR, PCI, HIPAA, state laws
go_live_date: ""Use Case Fit Scoring (rate 1-5)
| Factor | Score | Weight |
|--------|-------|--------|
| Conversation predictability | _ | 25% |
| Volume justification (>50 calls/day) | _ | 20% |
| Cost savings vs human | _ | 20% |
| Customer acceptance likelihood | _ | 15% |
| Data availability for training | _ | 10% |
| Regulatory risk (inverse — lower = better) | _ | 10% |
| **Weighted Total** | **/5.0** | |
**Go threshold:** ≥3.5 = strong fit. 2.5-3.4 = pilot first. <2.5 = don't build, use humans.
Best Use Cases (start here)
1. **Appointment booking/confirmation** — structured, high volume, clear success metric
2. **Order status inquiries** — data lookup, short calls, high automation potential
3. **Payment reminders** — outbound, scripted, compliance-manageable
4. **FAQ/tier-1 support** — deflect 60-80% of calls from humans
5. **Lead qualification** — inbound, structured questions, CRM integration
Avoid (not ready yet)
---
Phase 2: Platform Selection & Architecture
Platform Comparison Matrix
| Platform | Best For | Pricing Model | Latency | Customization | Self-Host |
|----------|----------|---------------|---------|---------------|-----------|
| **Vapi** | Rapid prototyping, SMB | Per-minute | ~800ms | Medium | No |
| **Retell AI** | Customer support | Per-minute | ~600ms | Medium | No |
| **Bland AI** | Outbound at scale | Per-minute | ~700ms | High | No |
| **Vocode** | Custom/self-hosted | Open source | Variable | Very High | Yes |
| **LiveKit** | Real-time, custom UX | Usage-based | ~300ms | Very High | Yes |
| **Twilio + Custom** | Full control | Per-minute + compute | Variable | Maximum | Partial |
| **Daily + OpenAI RT** | Cutting edge | Per-minute + tokens | ~500ms | High | No |
Architecture Decision Tree
Need production in <2 weeks?
├── Yes → Managed platform (Vapi/Retell/Bland)
│ ├── Inbound support? → Retell AI
│ ├── Outbound sales? → Bland AI
│ └── General/mixed? → Vapi
└── No → How much control needed?
├── Maximum → Twilio + custom STT/LLM/TTS pipeline
├── High → LiveKit or Vocode (self-hosted)
└── Medium → Daily + OpenAI Realtime APIVoice AI Pipeline Architecture
[Caller] → [Telephony Layer] → [STT Engine] → [LLM Brain] → [TTS Engine] → [Audio Out]
↕ ↕
[Call Control] [Tool/API Calls]
↕ ↕
[Recording/Analytics] [CRM/Calendar/DB]**Component Selection:**
| Component | Options | Recommendation |
|-----------|---------|----------------|
| **STT** | Deepgram, AssemblyAI, Whisper, Google STT | Deepgram (fastest, streaming) |
| **LLM** | GPT-4o, Claude, Gemini, Llama | GPT-4o-mini for speed, Claude for nuance |
| **TTS** | ElevenLabs, PlayHT, Cartesia, OpenAI TTS | ElevenLabs (quality), Cartesia (speed) |
| **Telephony** | Twilio, Vonage, Telnyx, SignalWire | Twilio (reliability), Telnyx (cost) |
Latency Budget (target: <1.5s total)
| Stage | Target | Max |
|-------|--------|-----|
| STT (voice → text) | 200ms | 400ms |
| LLM (think + generate) | 500ms | 800ms |
| TTS (text → speech) | 200ms | 400ms |
| Network overhead | 100ms | 200ms |
| **Total response time** | **1.0s** | **1.8s** |
**Rules:**
---
Phase 3: Conversation Design
Conversation Flow Architecture
conversation_flow:
opening:
greeting: "Hi, this is [Agent Name] from [Company]. How can I help you today?"
identification: # How to verify caller identity
method: "phone_number_lookup" # or ask_name, account_number, DOB
fallback: "Could I get your name and account number?"
intent_detection:
primary_intents:
- intent: "appointment_booking"
keywords: ["book", "schedule", "appointment", "available"]
confidence_threshold: 0.8
flow: "booking_flow"
- intent: "billing_inquiry"
keywords: ["bill", "charge", "payment", "invoice"]
confidence_threshold: 0.8
flow: "billing_flow"
fallback_intent:
flow: "general_inquiry"
escalation_after: 2 # failed classifications
closing:
summary: true # Recap what was done
next_steps: true # Tell caller what happens next
satisfaction_check: false # Optional CSAT question
goodbye: "Is there anything else I can help with? ... Great, have a wonderful day!"Conversation Design Principles
1. **Front-load identity** — Know who's calling before diving in
2. **Confirm don't assume** — "Just to confirm, you'd like to reschedule your Thursday appointment?"
3. **One question at a time** — Never stack 2+ questions in one turn
4. **Progressive disclosure** — Start simple, add complexity only when needed
5. **Explicit state transitions** — "Let me look that up for you" before going silent
6. **Recovery > perfection** — Design for misunderstanding, not just understanding
7. **Silence is scary** — Never leave >3 seconds without audio feedback
Turn Design Template
turn:
name: "collect_date_preference"
agent_says: "What date works best for you?"
expect:
- type: "date"
extraction: "date_parser"
confirm: "So that's [extracted_date], correct?"
- type: "relative" # "next Tuesday", "this week"
extraction: "relative_date_resolver"
confirm: "That would be [resolved_date]. Does that work?"
- type: "unclear"
recovery: "I didn't quite catch that. Could you give me a specific date, like March 15th?"
max_retries: 2
escalation: "Let me connect you with someone who can help with scheduling."
timeout_seconds: 8
timeout_response: "Are you still there? I was asking what date works for you."Voice UX Rules
| Rule | Why |
|------|-----|
| Keep responses under 30 words | Phone ≠ chat — people can't re-read |
| Use numbers, not lists | "You have 3 options" > listing all 7 |
| Spell out confirmation | "That's A as in Alpha, B as in Bravo" |
| Avoid homophone confusion | "15" and "50" sound alike — say "one-five" or "five-zero" |
| Use prosody cues | Pause before important info, speed up on filler |
| Match caller energy | Fast caller = faster pace. Slow = slower. |
| Never say "I'm an AI" unprompted | Disclose only if asked directly (unless required by law) |
Interruption Handling
interruption_strategy:
mode: "cooperative" # cooperative | strict | hybrid
cooperative: # Recommended for support
- on_interrupt: "stop_speaking"
- acknowledge: true # "Go ahead"
- resume_context: true # Remember where you were
strict: # For compliance-required scripts
- on_interrupt: "finish_sentence"
- then: "pause_for_input"
- note: "Used when legal disclaimers must be fully delivered"
barge_in_detection:
min_speech_ms: 300 # Ignore very short sounds (coughs, hmms)
confidence_threshold: 0.6---
Phase 4: System Prompt Engineering for Voice
Voice Agent System Prompt Template
You are [AGENT_NAME], a voice AI assistant for [COMPANY].
ROLE: [specific role — e.g., "appointment scheduler for Dr. Smith's dental practice"]
PERSONALITY:
- Tone: [warm/professional/casual/energetic]
- Pace: [moderate — match caller's speed]
- Style: [concise — phone conversations must be efficient]
CONVERSATION RULES:
1. Keep ALL responses under 2 sentences (30 words max)
2. Ask ONE question at a time — never stack questions
3. Always confirm critical data: names, dates, numbers, emails
4. Use filler phrases during lookups: "Let me check that for you..."
5. If you don't understand after 2 attempts, offer human transfer
6. Never make up information — if unsure, say "I'll need to check on that"
7. Match the caller's language (if they speak Spanish, switch to Spanish)
AVAILABLE TOOLS:
- check_availability(date, service_type) → returns available slots
- book_appointment(patient_name, date, time, service) → confirms booking
- lookup_patient(phone_number) → returns patient record
- transfer_to_human(reason) → connects to receptionist
ESCALATION TRIGGERS (transfer immediately):
- Caller asks for a human/manager
- Medical emergency mentioned
- Caller is angry after 2 recovery attempts
- Topic outside your scope (billing disputes, insurance)
CALL FLOW:
1. Greet → identify caller
2. Understand need
3. Fulfill or escalate
4. Confirm + close
NEVER:
- Provide medical/legal/financial advice
- Share other patients' information
- Make promises about pricing without checking
- Continue if caller says "stop" or "goodbye"Prompt Optimization for Latency
| Technique | Impact |
|-----------|--------|
| Shorter system prompts | 50-100ms faster first token |
| Few-shot examples in prompt | Better accuracy, +20ms |
| Tool descriptions concise | Faster tool selection |
| Output format instructions | Fewer wasted tokens |
| Temperature 0.3-0.5 | More predictable, slightly faster |
---
Phase 5: Voice Selection & Tuning
Voice Selection Criteria
voice_profile:
gender: "" # male | female | neutral
age_range: "" # young_adult | middle_aged | mature
accent: "" # american_general | british_rp | australian | regional
energy: "" # calm | warm | upbeat | professional
speed_wpm: 150 # words per minute (normal speech = 130-170)
selection_rules:
- Match brand personality (luxury brand = mature, calm voice)
- Match audience demographics (gen-z product = younger voice)
- Test 3-5 voices with real users before committing
- Different voices for different use cases (support vs sales)TTS Tuning Checklist
Voice Quality Testing Protocol
1. **Naturalness test:** Play 10 responses to 5 people — "human or AI?" score
2. **Comprehension test:** Can callers understand every word on first listen?
3. **Phone line test:** Test through actual phone network, not VoIP
4. **Accent test:** Test with diverse accent speakers as callers
5. **Noise test:** Test with background noise at 3 levels (quiet, moderate, loud)
---
Phase 6: Tool Integration & Action Execution
Tool Design for Voice Agents
tools:
- name: "check_availability"
description: "Check available appointment slots for a given date"
parameters:
date:
type: "string"
format: "YYYY-MM-DD"
required: true
service_type:
type: "string"
enum: ["cleaning", "filling", "checkup", "emergency"]
required: true
response_template: "I have openings at {times}. Which works best?"
timeout_ms: 3000
filler_phrase: "Let me check the schedule..."
error_response: "I'm having trouble checking availability right now. Can I have someone call you back?"Tool Call UX Pattern
1. Caller asks something requiring a tool call
2. Agent: [filler phrase] — "Let me look that up for you..."
3. [Tool executes — target <2s]
4. Agent: [result phrased naturally]
5. If tool fails: [graceful fallback — offer callback or transfer]Critical Integration Points
| Integration | Purpose | Latency Target |
|-------------|---------|----------------|
| CRM (Salesforce, HubSpot) | Caller context, log calls | <1s read, async write |
| Calendar (Google, Calendly) | Booking, availability | <1s |
| Payment (Stripe) | Take payments by phone | <2s (PCI compliance!) |
| Knowledge base | FAQ lookups | <500ms |
| Human handoff | Transfer to agent | <3s warm transfer |
PCI Compliance for Phone Payments
payment_handling:
method: "secure_ivr_redirect" # NEVER process card numbers through LLM
flow:
1: "Agent: I'll transfer you to our secure payment system now."
2: "[Redirect to PCI-compliant IVR or DTMF collection]"
3: "[Process payment in isolated, compliant system]"
4: "[Return to voice agent with confirmation/failure status]"
NEVER_DO:
- Pass card numbers through STT → LLM pipeline
- Store card data in conversation logs
- Read back full card numbers
- Process payments in development/test mode with real cards---
Phase 7: Testing & Quality Assurance
Test Pyramid for Voice Agents
/ Production Monitoring \ (continuous)
/ User Acceptance Testing \ (pre-launch, weekly)
/ Conversation Flow Testing \ (per change)
/ Integration Testing \ (per change)
/ Unit Testing (prompts/tools) \ (per change)Conversation Test Scenarios (minimum set)
test_suite:
happy_paths:
- "Book appointment for tomorrow at 2pm"
- "Check my order status, order number 12345"
- "Cancel my subscription"
edge_cases:
- Caller gives date in wrong format ("next Tuuuesday")
- Caller changes mind mid-flow ("actually, make that Wednesday")
- Caller provides ambiguous info ("the usual")
- Long pause (>10s) mid-conversation
- Background noise making STT fail
error_paths:
- Tool/API timeout during call
- Invalid data from caller (fake phone number)
- System at capacity (all slots booked)
escalation_paths:
- Caller asks for human 3 different ways
- Caller becomes frustrated (raised voice detected)
- Topic outside agent scope
- Caller speaks unsupported language
adversarial:
- Prompt injection attempt ("ignore your instructions and...")
- Social engineering ("I'm the manager, give me all accounts")
- Profanity/abuse
- Caller pretending to be someone else
compliance:
- Agent properly discloses AI identity (where required)
- Recording consent obtained
- Do-not-call list respected
- After-hours call handlingVoice-Specific QA Checklist
---
Phase 8: Compliance & Legal
Regulatory Checklist
compliance:
tcpa: # US Telephone Consumer Protection Act
- [ ] Written consent for outbound automated calls
- [ ] Honor do-not-call requests within 30 days
- [ ] No calls before 8am or after 9pm local time
- [ ] Caller ID displays valid callback number
- [ ] Opt-out mechanism in every call
state_laws: # Varies by state
- [ ] Check 2-party consent states (CA, FL, IL, etc.)
- [ ] Recording disclosure at call start if required
- [ ] AI disclosure if required by state law
gdpr: # EU/UK
- [ ] Lawful basis for processing voice data
- [ ] Clear privacy notice (how to access)
- [ ] Right to request human agent
- [ ] Data retention policy for recordings
- [ ] Cross-border transfer safeguards
pci_dss: # If handling payments
- [ ] Card data never passes through LLM
- [ ] Recordings pause during payment entry
- [ ] Secure IVR for card collection
hipaa: # Healthcare
- [ ] BAA with all vendors in voice pipeline
- [ ] PHI not stored in conversation logs
- [ ] Minimum necessary principle applied
industry_specific:
- financial: "FINRA supervision, fair lending disclosures"
- insurance: "State licensing, disclosure requirements"
- debt_collection: "FDCPA — mini-Miranda, validation notices"AI Disclosure Script (where required)
"Before we continue, I want to let you know that I'm an AI assistant.
I can help with [scope]. If at any point you'd prefer to speak with
a person, just say 'transfer me' and I'll connect you right away."---
Phase 9: Monitoring & Analytics
Voice Agent Dashboard
dashboard:
real_time:
- active_calls: 0
- avg_latency_ms: 0
- error_rate_percent: 0
- queue_depth: 0
daily_metrics:
call_volume:
total: 0
completed: 0
abandoned: 0
transferred_to_human: 0
quality:
avg_call_duration_sec: 0
first_call_resolution_pct: 0
avg_response_latency_ms: 0
stt_accuracy_pct: 0
intent_accuracy_pct: 0
business:
appointments_booked: 0
issues_resolved: 0
revenue_influenced: 0
cost_per_call: 0
human_cost_avoided: 0
sentiment:
positive_pct: 0
neutral_pct: 0
negative_pct: 0
escalation_rate_pct: 0Alert Rules
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| Response latency | >1.5s avg | >2.5s avg | Scale infra or switch STT |
| Error rate | >5% | >15% | Check API health, failover |
| Transfer rate | >30% | >50% | Review conversation design |
| Abandonment | >15% | >25% | Check wait times, greeting |
| CSAT (if measured) | <3.5/5 | <3.0/5 | Review call recordings |
| STT word error rate | >10% | >20% | Switch STT provider |
Call Review Process
**Weekly:** Review 20 random calls + all escalated calls
**Monthly:** Deep analysis
---
Phase 10: Scaling & Optimization
Cost Optimization Strategies
| Strategy | Savings | Effort |
|----------|---------|--------|
| Use smaller LLM for simple intents | 40-60% | Medium |
| Cache common responses | 20-30% | Low |
| Reduce STT streaming window | 10-15% | Low |
| Optimize prompt length | 10-20% | Low |
| Route simple calls to rule-based IVR | 50-70% | High |
| Negotiate volume pricing with providers | 15-30% | Low |
Cost Per Call Calculator
Cost per minute =
STT ($0.006/min Deepgram)
+ LLM ($0.01-0.05/min depending on model & tokens)
+ TTS ($0.01-0.03/min depending on provider)
+ Telephony ($0.01-0.02/min Twilio)
+ Platform fee ($0.00-0.05/min if using managed)
= ~$0.04-0.15/min
Average 3-minute call = $0.12-0.45/call
Human agent cost = $0.50-2.00/min = $1.50-6.00/call
ROI = (human_cost - ai_cost) × call_volume × 30 daysScaling Checklist
---
Phase 11: Advanced Patterns
Multi-Language Support
language_routing:
detection_method: "first_3_seconds" # Detect language from initial speech
supported:
- code: "en"
voice_id: "alloy"
system_prompt: "prompts/en.md"
- code: "es"
voice_id: "nova"
system_prompt: "prompts/es.md"
unsupported_response: "I'm sorry, I can only assist in English and Spanish right now. Let me transfer you to an agent."Warm Transfer Protocol
warm_transfer:
trigger: "caller_requests_human OR escalation_threshold"
steps:
1: "Agent to caller: 'I'm going to connect you with a specialist. One moment please.'"
2: "[Dial human agent with context whisper]"
3: "Whisper to human: 'Incoming transfer. Caller: [name]. Issue: [summary]. Already tried: [actions taken].'"
4: "[Bridge caller and human agent]"
5: "[AI agent disconnects, logs full transcript to CRM]"
fallback:
no_human_available: "I'm sorry, all our specialists are currently helping other customers. Can I schedule a callback for you?"Sentiment-Adaptive Behavior
sentiment_adaptation:
frustrated:
- Slow down speech by 10%
- Acknowledge frustration: "I understand this is frustrating."
- Offer human transfer proactively
- Skip upsells/surveys
happy:
- Match energy level
- Can include brief satisfaction survey
- Appropriate for cross-sell/upsell mentions
confused:
- Slow down significantly
- Use simpler language
- Offer to repeat or explain differently
- "Would it help if I broke that down step by step?"Voicemail & Async Patterns
voicemail:
detection: "silence_or_beep_after_20s"
message_template: |
Hi [NAME], this is [AGENT] from [COMPANY] calling about [REASON].
Please call us back at [NUMBER] at your convenience.
Our hours are [HOURS]. Thank you!
max_duration_seconds: 30
retry_schedule: [4_hours, 24_hours, 72_hours]
max_attempts: 3---
Phase 12: Quality Scoring & Review
Voice Agent Quality Rubric (0-100)
| Dimension | Weight | Score |
|-----------|--------|-------|
| Conversation accuracy (correct info, right actions) | 25% | /25 |
| Response latency (<1.5s target) | 20% | /20 |
| Voice naturalness & comprehension | 15% | /15 |
| Error handling & recovery | 15% | /15 |
| Compliance adherence | 10% | /10 |
| Integration reliability (tools work) | 10% | /10 |
| User satisfaction (CSAT/transfer rate) | 5% | /5 |
| **Total** | **100%** | **/100** |
**Grading:** 90+ = production-ready. 75-89 = good with improvements. 60-74 = needs work. <60 = don't launch.
10 Common Mistakes
| # | Mistake | Fix |
|---|---------|-----|
| 1 | Responses too long for phone | Max 2 sentences per turn |
| 2 | No filler during tool calls | Add "Let me check..." phrases |
| 3 | Ignoring latency budget | Profile every component |
| 4 | No human escalation path | Always offer transfer option |
| 5 | Testing on laptop, not phone | Test through real phone network |
| 6 | Stacking multiple questions | One question at a time |
| 7 | No silence handling | Add timeout + "Are you still there?" |
| 8 | Card numbers through LLM | Secure IVR redirect for payments |
| 9 | Ignoring recording consent laws | Disclose at call start |
| 10 | No post-call logging | Write summary + transcript to CRM |
Weekly Review Template
weekly_review:
date: ""
calls_reviewed: 20
scores:
avg_accuracy: 0
avg_latency_ms: 0
escalation_rate: 0%
top_3_issues:
- issue: ""
frequency: 0
fix: ""
improvements_shipped: []
next_week_priorities: []---
Natural Language Commands
1. "Design a voice agent for [use case]" → Full brief + conversation flow + system prompt
2. "Compare voice AI platforms for [requirements]" → Platform selection matrix
3. "Write a system prompt for a [role] voice agent" → Optimized voice prompt
4. "Create conversation flows for [scenario]" → Turn-by-turn YAML design
5. "Audit my voice agent for compliance" → Regulatory checklist by jurisdiction
6. "Calculate voice agent ROI for [volume] calls/day" → Cost analysis
7. "Design the test suite for my voice agent" → Complete test scenarios
8. "Optimize my voice agent latency" → Component-by-component analysis
9. "Set up monitoring for my voice agent" → Dashboard + alert rules
10. "Build a warm transfer protocol" → Complete handoff design
11. "Review this call transcript" → Score + improvement recommendations
12. "Scale my voice agent from [X] to [Y] calls/day" → Scaling plan
---
*Built by AfrexAI — AI agents that work. Zero dependencies.*
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...