Voice & Speech Configuration
Your agent’s voice is crucial to creating engaging, natural conversations. BlackBox supports five production-ready TTS providers, each with unique characteristics, voices, and configuration options.Overview
Voice configuration involves two main components:- Text-to-Speech (TTS): Converts your agent’s responses into natural-sounding audio
- Speech-to-Text (STT/ASR): Transcribes user speech into text for the LLM to process
What You’ll Configure
- TTS Provider: Choose from ElevenLabs, Cartesia, Dasha, Inworld, or LMNT
- Voice Selection: Pick from hundreds of voices across languages and accents
- Voice Model: Select the synthesis model (quality vs. speed tradeoff)
- Speech Speed: Adjust playback rate (provider-dependent)
- Provider-Specific Options: Fine-tune voice characteristics
ASR Configuration
The platform dynamically chooses the best ASR provider based on:- User’s detected language and accent
- Network conditions and latency
- Call quality metrics
- Provider availability
TTS Provider Comparison
Quick Reference Table
| Provider | Best For | Speed Range | Emotions | Voice Cloning | Latency |
|---|---|---|---|---|---|
| ElevenLabs | Natural quality, customization | 0.70x - 1.20x | Via style parameter | Yes | Medium |
| Cartesia | Ultra-low latency, emotions | 0x - 2.0x (speeds below 0.25x sent as 0.25x) | 20 emotion levels | No | Ultra-low |
| Dasha | Platform-native, widest speed range | 0.25x - 4.0x | No | No | Low |
| Inworld | Character voices, gaming | 0.80x - 1.50x | Via temperature/pitch | No | Medium |
| LMNT | Consistent, lightweight | Fixed (1.0x) | No | No | Low |
Provider Deep Dive
- ElevenLabs
- Cartesia
- Dasha
- Inworld
- LMNT
ElevenLabs
Strengths:- Industry-leading voice quality
- Extensive voice library (1000+ voices)
- Advanced customization options
- Voice cloning support
- Multilingual capabilities
eleven_multilingual_v2- Multilingual V2 (best quality)eleven_turbo_v2_5- Turbo V2.5 (balanced)eleven_flash_v2_5- Flash V2.5 (fastest)
-
Similarity Boost (0.0 - 1.0, default: 0.75)
- Controls voice consistency with original
- Higher = more similar to base voice
- Lower = more variation allowed
-
Stability (0.0 - 1.0, default: 0.5)
- Controls voice stability across generations
- Higher = more consistent output
- Lower = more expressive/variable
-
Style (0.0 - 1.0, default: 0.3)
- Controls speaker style exaggeration
- Higher = more stylized delivery
- Lower = more neutral tone
-
Use Speaker Boost (boolean, default: true)
- Enhances voice clarity and quality
- Recommended for most use cases
-
Optimize Streaming Latency (0-4, default: 4)
- Trades quality for lower latency
- 0 = highest quality, highest latency
- 4 = lowest latency, acceptable quality
Voice Selection
Browsing Available Voices
Access all provider voices through the Voice & Speech tab:- Navigate: Go to agent creation/editing → Voice & Speech tab
- Select Provider: Choose your TTS provider first
- Browse Voices: Search and filter available voices
- Preview: Listen to voice samples before selecting
- Select: Choose your preferred voice

Voice Attributes
Each voice includes metadata:- Name: Voice identifier (e.g., “Rachel”, “Mark”)
- Language: Primary language and locale (e.g., “en-US”, “es-ES”)
- Gender: Male, Female, or Neutral
- Description: Voice characteristics and use cases
- Provider: TTS service (ElevenLabs, Cartesia, etc.)
Custom Voice IDs
You can use voice IDs not listed in the default picker:- Find the voice ID from your TTS provider’s documentation
- Click “Use custom voice ID” button below the voice selector
- Enter the custom voice ID in the dialog that appears
- Preview to verify the voice works correctly
- Save your agent configuration
Voice Preview
Using Voice Preview
Before committing to a voice, test it with your actual content:- Enter Preview Text: Type or paste sample text (up to 1000 characters)
- Configure Settings: Adjust speed, options as needed
- Click Preview: Generate and listen to audio sample
- Iterate: Try different voices, speeds, and settings
- Select Best Match: Choose the configuration that sounds best

Preview Best Practices
Text Selection:- Use representative samples from your agent’s actual responses
- Include questions, statements, and conversational phrases
- Test punctuation handling (commas, periods, exclamation points)
- Try names, numbers, and special terms your agent will use
- Natural pronunciation of domain-specific terms
- Appropriate pacing for your use case
- Emotional tone matches your brand
- Clarity at different speeds
- Consistent quality across phrases
Character Limit
Voice preview supports up to 1000 characters per preview.- Real-time character counter shows remaining characters
- Warning displays when approaching limit
- Limit enforced by
VOICE_PREVIEW_TEXT_LIMITconstant
Technical Detail: Preview uses the same
/api/v1/voice/synthesize endpoint as production calls, ensuring accuracy.Speed Configuration
Speed Ranges by Provider
Different providers support different speed ranges:| Provider | Min Speed | Max Speed | Default | Recommended Range |
|---|---|---|---|---|
| ElevenLabs | 0.70x | 1.20x | 1.0x | 0.9x - 1.2x |
| Cartesia | 0x (sent as 0.25x) | 2.0x | 1.0x | 0.8x - 1.3x |
| Dasha | 0.25x | 4.0x | 1.0x | 0.8x - 1.5x |
| Inworld | 0.80x | 1.50x | 1.0x | 0.9x - 1.1x |
| LMNT | Fixed 1.0x | Fixed 1.0x | 1.0x | 1.0x (only) |
Choosing the Right Speed
Slower Speeds (0.7x - 0.9x):- Accessibility needs
- Complex information delivery
- Educational content
- Non-native speakers
- Legal/compliance disclosures
- General conversation
- Customer support
- Sales calls
- Most use cases
- Time-sensitive scenarios
- Familiar/repetitive content
- High-volume information
- Experienced users
Voice Cloning
Create custom voices that match your brand identity with ElevenLabs voice cloning.Cloning Process
- Prepare Audio: Record 1-5 minutes of clean voice samples
- Upload: Use the voice cloning interface or API
- Configure: Set name, description, language
- Clone: ElevenLabs processes your samples
- Use: Select cloned voice in agent configuration
API-Based Voice Cloning
Managing Cloned Voices
Speech Recognition (ASR)
How ASR Works in BlackBox
Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), converts user audio into text for your agent’s LLM to process. Automatic Provider Selection: BlackBox automatically selects the optimal ASR provider based on:- Language Detection: Identifies user’s spoken language
- Accent Recognition: Adjusts for regional variations
- Network Quality: Adapts to connection conditions
- Provider Performance: Routes to best-performing provider
- Real-Time Optimization: Switches providers if quality degrades
Supported ASR Providers
BlackBox integrates with multiple ASR providers for redundancy and quality:- Deepgram: High-accuracy, low-latency transcription
- Microsoft Speech Services: Enterprise-grade recognition
- Auto (Platform-Managed): Automatic provider selection (recommended)
No Configuration Required: You don’t need to select an ASR provider. The platform handles this automatically for optimal results.
ASR Quality Factors
Several factors affect transcription accuracy: Audio Quality:- Clear microphone input
- Minimal background noise
- Good network connection
- Proper audio levels
- Speech clarity and pace
- Accent and pronunciation
- Use of domain-specific terms
- Speaking patterns
- Automatic noise cancellation
- Echo suppression
- Acoustic model adaptation
- Real-time quality monitoring
Improving ASR Accuracy
While ASR is automatic, you can help improve accuracy:- Agent Prompting: Guide users to speak clearly in your agent’s greeting
- Confirmation: Have agent repeat understood information for verification
- Clarification: Prompt for clarification when confidence is low
- Domain Terms: Provide common terms in your agent’s context (coming soon)
Configuration Examples
Basic Configuration (Recommended)
Simple, production-ready voice setup:ElevenLabs with Customization
High-quality voice with fine-tuned parameters:Cartesia with Emotions
Low-latency with emotional expressiveness:Multilingual Configuration
Agent supporting multiple languages:Testing Voice Configuration
Dashboard Testing
Use the built-in test widget to verify voice quality:- Save Agent: Save your voice configuration
- Open Test Widget: Click “Test Agent” in dashboard
- Start Conversation: Begin voice interaction
- Listen Carefully: Evaluate voice quality, speed, clarity
- Iterate: Adjust settings and re-test as needed

Voice Synthesis API Testing
Test TTS synthesis without creating an agent:What to Test
Quality Checklist:- Voice sounds natural and professional
- Speed is comfortable for target audience
- Pronunciation of key terms is correct
- Emotional tone matches use case
- No audio artifacts or glitches
- Consistent quality across phrases
- Latency is acceptable for real-time conversation
Best Practices
Voice Selection
- Match Your Brand: Choose voices that align with your brand identity
- Consider Audience: Select demographics-appropriate voices
- Test Multiple Options: Preview 3-5 voices before deciding
- Get Feedback: Test with representative users
- Document Choice: Note why you selected specific voices
Speed Settings
- Start at 1.0x: Use default speed as baseline
- Test Incrementally: Adjust in 0.05x increments
- Context Matters: Different content may need different speeds
- A/B Test: Compare speeds with real users
- Monitor Feedback: Track user satisfaction metrics
Provider Selection
Choose ElevenLabs if:- You need highest voice quality
- Brand-specific voice cloning is important
- Advanced customization is required
- You can tolerate slightly higher latency
- Real-time responsiveness is critical
- Emotional expression is important
- You need ultra-low latency (< 250ms)
- Conversation feels more important than absolute quality
- You want simple, reliable configuration
- Platform integration is a priority
- You need widest speed adjustment range
- Quick setup is important
- You’re building character-driven experiences
- Gaming or interactive media is your use case
- Voice expressiveness is critical
- You need consistent, predictable output
- Minimal configuration is desired
- Fixed speed (1.0x) works for your needs
Common Mistakes to Avoid
Troubleshooting
Voice Issues
Problem: Voice sounds robotic or unnatural- Solution: Try a different voice from the same provider
- Solution: Adjust stability/temperature parameters (ElevenLabs/Inworld)
- Solution: Switch to ElevenLabs for highest quality
- Solution: Adjust speed setting incrementally
- Solution: Test with representative users
- Solution: Consider accessibility needs (slower may be better)
- Solution: Use phonetic spelling in system prompt
- Solution: Try different voice from same provider
- Solution: Consider voice cloning with correct pronunciation
- Solution: Switch to Cartesia for lowest latency
- Solution: Use ElevenLabs with
optimize_streaming_latency: 4 - Solution: Check network conditions
ASR Issues
Problem: Poor transcription accuracy- Solution: No action needed - ASR auto-optimizes
- Solution: Prompt users to speak clearly in agent greeting
- Solution: Add confirmation/verification in conversation flow
- Solution: Ensure agent
primaryLanguagematches user language - Solution: Enable language switching if supporting multiple languages
Next Steps
- Test Voice Settings - Detailed voice testing guide
- Voice Cloning - Create custom branded voices
- Advanced Features - Language switching and more
- Tools & Functions - Add capabilities to your agent
API Cross-References
- GET
/api/v1/voice- List all available voices - POST
/api/v1/voice/synthesize- Synthesize speech for testing - POST
/api/v1/voice/clone- Clone custom voices - PATCH
/api/v1/voice/clone/{voiceId}- Update cloned voice - DELETE
/api/v1/voice/clone/{voiceId}- Delete cloned voice