Skip to main content

Best Practices

This guide provides proven strategies for building production-ready AI voice agents. Learn from real-world deployments and avoid common pitfalls.

Quick Reference

Be specific and structured. Define personality, constraints, and examples. Test edge cases.

Prompt Engineering

Your system prompt is the most critical factor in agent performance. A well-crafted prompt dramatically improves accuracy, consistency, and user satisfaction.

Structure Your Prompt

Use a clear, hierarchical structure for complex agents:
You are a customer support agent for Acme Corp.

# Your Role
Help customers with order inquiries, product questions, and basic troubleshooting.

# Conversation Style
- Be friendly and professional
- Use clear, concise language
- Confirm understanding before acting
- Never make promises you can't keep

# Capabilities
You can:
1. Look up order status using the check_order tool
2. Answer product questions using the knowledge base
3. Transfer complex issues to human support

# Constraints
You cannot:
- Process refunds (transfer to billing)
- Change shipping addresses after dispatch
- Make exceptions to return policy

# Example Conversation
Customer: "Where is my order?"
You: "I'll check that for you. Can you provide your order number?"
Customer: "ORDER-12345"
You: [Use check_order tool] "Your order shipped yesterday and will arrive Tuesday."
Vague prompts like “be helpful” lead to inconsistent behavior. Always define specific boundaries and examples.

Define Clear Boundaries

Explicitly state what the agent should and should not do: Do This:
  • “If the customer asks for a refund over $500, say: ‘I need to transfer you to our billing team who can help with that.’”
  • “For technical issues, first confirm the customer has tried basic troubleshooting before escalating.”
Avoid This:
  • “Handle customer issues appropriately.”
  • “Escalate when necessary.”

Use Examples for Complex Behaviors

Include 2-3 concrete examples of desired conversations:
# Example 1: Happy Path
Customer: "I want to schedule a demo"
You: "Great! What day works best for you?"
Customer: "Next Tuesday"
You: [Check calendar] "I have 2pm and 4pm available. Which works better?"

# Example 2: Handling Objections
Customer: "That's too expensive"
You: "I understand. Many customers save money in the long run because [value proposition]. Would you like to see a cost breakdown?"

# Example 3: Out of Scope
Customer: "Can you fix my computer?"
You: "I specialize in scheduling and product information. For technical support, please visit support.example.com or call 1-800-TECH."

Optimize for Voice

Voice conversations differ from text chat:
# Voice Guidelines
- Keep responses under 3 sentences when possible
- Use natural speech patterns: "I'll check that for you" not "Checking now"
- Avoid spelling out words unless asked
- Use verbal confirmations: "Got it" "Perfect" "Okay"
- Pause naturally with punctuation: periods, commas
Voice users can’t see text. Replace visual cues with verbal confirmations: “I found three options for you. First…” instead of bullet points.

Handle Edge Cases

Always define behavior for common edge cases: Silence or No Response:
If the customer doesn't respond for 5 seconds:
- First time: "Are you still there?"
- Second time: "I didn't catch that. Would you like to continue?"
- Third time: "I'll end the call now. Call back anytime!"
Background Noise:
If you can't understand the customer:
- "Sorry, I'm having trouble hearing you. Could you repeat that?"
- Don't guess - always confirm unclear information
Off-Topic Requests:
If asked about something outside your scope:
- Acknowledge: "That's not something I handle"
- Redirect: "But I can help you with [X, Y, Z]"
- Offer transfer: "Would you like me to connect you with someone who can help?"

Test with Real Scenarios

Use actual customer transcripts to test your prompt:
  1. Collect 10-20 real conversations from your domain
  2. Run test calls with the same questions
  3. Compare responses to human agent responses
  4. Iterate on edge cases and failure patterns

LLM Configuration

Choosing the right model and parameters dramatically affects performance, cost, and latency.

Model Selection

Choose based on your use case priority:

Temperature Tuning

Temperature controls randomness in responses. Find the sweet spot for your use case:
TemperatureBehaviorBest For
0.0 - 0.3Deterministic, repetitiveExact scripts, data lookup, strict protocols
0.4 - 0.6Focused, consistentTechnical support, compliance-sensitive conversations
0.7 - 0.8Balanced (Recommended)General customer service, sales, most use cases
0.9 - 1.2Creative, variedPersonality-driven bots, entertainment, brainstorming
1.3 - 2.0Highly creative, unpredictableCreative writing (not recommended for voice agents)
Temperature above 1.0 can produce inconsistent or confusing responses in voice conversations. Start at 0.7 and adjust based on testing.
Testing Temperature:
  1. Use the same test conversation 5 times at each temperature (0.5, 0.7, 0.9)
  2. Measure:
    • Consistency (do responses vary appropriately?)
    • Accuracy (are facts correct?)
    • Tone (does personality match your brand?)
  3. Choose the lowest temperature that maintains natural conversation

Max Tokens Strategy

Set max tokens based on response length needs: Voice Response Guidelines:
  • Short answers (100-150 tokens): “Your order ships tomorrow” - Good for quick lookups
  • Medium responses (200-300 tokens): Explanations with 2-3 key points - Most voice conversations
  • Long responses (400-500 tokens): Detailed troubleshooting steps - Only when necessary
  • Avoid 1000+ tokens: Voice users lose attention after 30-45 seconds
For voice, shorter is better. Aim for responses under 30 seconds of speech (~200 tokens). Break complex information into multiple exchanges.

Service Tier Priority (OpenAI Only)

OpenAI’s priority tier reduces latency for real-time voice applications:
  • Priority Tier (Toggle): When enabled, sets vendorSpecificOptions.service_tier = "priority" for lower latency and higher throughput
  • Default (Priority Off): Standard latency, lower cost
The platform exposes a single checkbox: “Use priority tier (lower latency)”. Enable it for voice agents where sub-second response times are critical, or leave it off if cost is the primary concern.

Voice and Speech Optimization

Voice quality and natural speech patterns are critical for user experience.

TTS Provider Selection

Each provider has different strengths:
Best For: High-quality, natural-sounding voicesStrengths:
  • Most natural prosody and emotion
  • Excellent multilingual support
  • Voice cloning capabilities
  • Fine-grained emotion controls
Settings:
  • Speed: 0.9-1.1x (1.0 is natural)
  • Stability: 0.4-0.6 (higher = more consistent, less expressive)
  • Similarity Boost: 0.7-0.8
  • Style: 0.2-0.4 (higher = more exaggerated)
Latency: Medium (~800-1200ms first byte)

Voice Speed Guidelines

Adjust speed based on content complexity and audience:
SpeedUse CaseExample
0.8x - 0.9xComplex information, elderly usersTechnical support, healthcare
1.0xStandard (Recommended)Most conversations
1.1x - 1.2xSimple information, younger usersOrder confirmations, quick updates
1.3x+Very simple, repetitive contentAutomated announcements
Speeds above 1.3x can feel robotic or rushed. Test with real users before deploying.

Speech Recognition Best Practices

While ASR is auto-selected by BlackBox, you can optimize for it: In Your Prompts:
  • Ask confirmation questions: “Did you say ECHO-1234?”
  • Spell out ambiguous information: “That’s E as in Echo, C as in Charlie…”
  • Use verbal checksums: “Your confirmation code is 1-2-3-4. That’s one, two, three, four.”
Handling Misrecognition:
# In your system prompt
If you're unsure what the customer said:
- Ask for clarification: "I heard [X]. Is that correct?"
- Offer alternatives: "Did you say 'cancel' or 'change'?"
- Request spelling for important data: "Could you spell your last name?"

Conversation Design

Design conversations that feel natural and accomplish goals efficiently.

Conversation Flow Patterns

Use these proven patterns:
Agent: "Hi! This is [Agent Name] from [Company]. How can I help you today?"
[Customer states intent]
Agent: "I can help with that. Let me [action]."
[Perform action with tool/lookup]
Agent: "Done! Is there anything else I can help you with?"
[If no] "Great! Have a wonderful day."

Turn-Taking and Interruptions

Design for natural conversation flow: Allow Natural Interruptions:
# In your prompt
- Let customers interrupt you at any time
- If interrupted, stop immediately and listen
- Don't repeat what you were saying unless asked
- Acknowledge the interruption: "Sure" or "Go ahead"
Prevent Long Monologues:
# Bad (voice only, no turn-taking)
Agent: "Your order contains item A, item B, and item C. Item A ships from warehouse 1 and should arrive Monday. Item B ships from warehouse 2 and should arrive Wednesday. Item C is backordered and will ship next week. Your total is $99.99. Shipping is free. Tracking numbers are..."

# Good (chunked with pauses)
Agent: "Your order has three items. Two ship this week, one is backordered."
[Pause for reaction]
Agent: "Would you like the tracking details?"

Error Recovery

Plan for conversation breakdowns: Misunderstanding Recovery:
Customer: [Says something unclear]
Agent: "I didn't quite catch that. Could you repeat?"
Customer: [Still unclear]
Agent: "Let me offer some options: Are you calling about [A], [B], or [C]?"
System Failure Recovery:
# If tool call fails
Agent: "I'm having trouble looking that up right now. Let me try a different way."
[Try alternative]
Agent: "I apologize, our system is slow today. Would you like me to email this to you instead?"
Scope Boundary:
Customer: "Can you help me with [out of scope]?"
Agent: "I don't handle that, but I can transfer you to [department] who can help. Would you like that?"

Testing Strategies

Systematic testing prevents production failures and poor user experiences.

Pre-Launch Testing Checklist

Test these scenarios before deploying: Happy Path (5-10 tests):
  • Simple request with immediate answer
  • Multi-step conversation (3+ turns)
  • Tool/function call succeeds
  • Transfer to human works
  • End conversation naturally
Edge Cases (10-15 tests):
  • Silence for 10+ seconds
  • Customer interrupts mid-sentence
  • Background noise (music, traffic, talking)
  • Customer speaks very fast or slow
  • Repeated misrecognition of same word
  • Request for something out of scope
  • Customer is angry or frustrated
  • Multiple requests in one turn
  • Customer changes mind mid-conversation
Failure Modes (5-10 tests):
  • Tool call times out
  • Tool returns error
  • Invalid data format
  • Webhook doesn’t respond
  • Network interruption
  • Customer hangs up mid-turn
Record all test calls and review transcripts. Common patterns in failures reveal prompt weaknesses.

Load Testing

For high-volume deployments, test concurrency:
  1. Start small: 5-10 concurrent calls
  2. Measure: Latency, error rate, call quality
  3. Increase gradually: Double concurrency each round
  4. Monitor: Watch for degradation patterns
  5. Set limits: Configure concurrency caps based on results
See Concurrency Monitoring for monitoring tools.

A/B Testing

Compare agent versions systematically: Version A vs B:
  • Same agent, different prompts
  • Same prompt, different temperatures
  • Same config, different voices
Metrics to Track:
  • Call success rate (user achieved goal)
  • Average call duration
  • Tool call accuracy
  • User satisfaction (if post-call analysis enabled)
  • Transfer rate (lower is often better)
Sample Size: Run at least 50 calls per version before drawing conclusions.

Performance Optimization

Optimize for latency, cost, and quality based on your priorities.

Latency Reduction

Voice agents are latency-sensitive. Reduce delays: Choose Fast Components:
  • LLM: gpt-4.1-mini, llama-3.1-8b-instant, grok-3-mini
  • TTS: Cartesia (fastest), Dasha (fast), ElevenLabs (slower but high quality)
  • ASR: Auto-selection handles this
Prompt Optimization:
  • Shorter prompts = faster processing
  • Remove redundant instructions
  • Use tools instead of long context
Response Length:
  • Max tokens: 150-300 for voice
  • Concise system prompts
  • Discourage verbose responses
Measure end-to-end latency using the test widget. Aim for under 2 seconds from user silence to agent speech start.

Cost Optimization

Reduce costs without sacrificing quality: Model Selection:
  • deepseek-r1 - 30x cheaper than GPT-4, similar quality
  • llama-3.1-8b-instant - Very low cost per token
  • gpt-4.1-nano - OpenAI’s most cost-effective
Token Reduction:
  • Shorter system prompts (remove examples if not needed)
  • Lower max tokens (100-200 for simple agents)
  • Use tools for data lookup (don’t put data in prompt)
Caching (where supported):
  • Reuse common prompt segments
  • Cache knowledge base embeddings
  • Minimize unique per-call prompt variations

Quality Monitoring

Track these metrics in production: Per-Call Metrics:
  • Success rate (did user achieve goal?)
  • Call duration (outliers indicate issues)
  • Tool call accuracy
  • Number of clarification requests
Aggregate Metrics:
  • Daily/weekly call volume trends
  • Error rate by error type
  • User satisfaction scores (via post-call analysis)
  • Transfer rate (escalations to human)
Set Alerts:
  • Error rate > 5% in 1 hour
  • Average call duration > 2x baseline
  • Success rate < 70%
  • Concurrency limit reached
See Analytics and Agent Performance for monitoring dashboards.

Common Anti-Patterns

Avoid these mistakes that lead to poor user experiences:

Overly Complex Prompts

You are a customer service representative with 15 years of experience working in telecommunications. You value customer satisfaction above all else and always go the extra mile. You should be empathetic, understanding, patient, kind, professional, courteous, and helpful. Always maintain a positive attitude even when customers are upset. Use active listening techniques. Employ de-escalation strategies when needed. Follow our 12-step customer service framework...

[Continues for 500+ lines]
Why: Long prompts increase latency, cost, and confusion. Focus on essential behavior only.

Ignoring Voice-Specific Design

Anti-Pattern: Designing for text chat and expecting it to work for voice Example: Using bullet points, tables, URLs, “click here” instructions Best Practice:
  • Verbal lists: “I have three options for you. First, second, third.”
  • No visual references: “I’ll send you a link” not “Click the button below”
  • Spell out important codes: “Your code is A-B-C-1-2-3. That’s Alpha Bravo Charlie one two three.”

Not Testing Edge Cases

Anti-Pattern: Only testing happy path scenarios Result: Agents that fail when customers deviate from expected behavior Best Practice:
  • Test with real background noise
  • Test with fast/slow speakers
  • Test interruptions and silence
  • Test unclear requests

Over-Engineering on Day 1

Anti-Pattern: Building a perfect agent with every feature before testing Result: Months of development before user feedback, misaligned features Best Practice:
  • Start with minimum viable agent (basic prompt + 1-2 tools)
  • Deploy to limited beta users
  • Iterate based on real conversation data
  • Add complexity only when needed

Ignoring Metrics

Anti-Pattern: “Set it and forget it” - no monitoring after deployment Result: Degraded performance goes unnoticed, user satisfaction drops Best Practice:
  • Daily review of key metrics (success rate, errors)
  • Weekly review of conversation samples
  • Monthly prompt optimization based on patterns
  • Set up automated alerts for anomalies

Production Deployment Checklist

Use this checklist before going live: Pre-Launch:
  • Test with 10+ users outside your team
  • Review 50+ test conversation transcripts
  • Set up monitoring dashboards
  • Configure error alerts
  • Monitor concurrency limits (contact support if you need a higher cap)
  • Test failure scenarios (timeouts, errors)
  • Verify webhook endpoints are live
  • Test call transfers work
  • Confirm business hours are correct
  • Set up post-call analysis (optional)
Launch Day:
  • Start with small percentage of traffic (10-20%)
  • Monitor metrics every hour
  • Have human fallback ready
  • Quick prompt iteration capability
  • Support team briefed on escalation
First Week:
  • Daily metric reviews
  • Sample conversation reviews
  • Collect user feedback
  • Adjust prompt based on findings
  • Gradually increase traffic
Ongoing:
  • Weekly performance reviews
  • Monthly prompt optimization
  • Quarterly voice/model updates
  • Regular A/B testing
See Production Checklist for detailed deployment guide.

Next Steps

API Cross-Refs