Skip to main content

Voice & Speech Configuration

Your agent’s voice is crucial to creating engaging, natural conversations. BlackBox supports five production-ready TTS providers, each with unique characteristics, voices, and configuration options.
Quick Start: New agents default to ElevenLabs Flash V2.5 (voice ID zmcVlqmyk3Jpn5AVYcAL). You can preview and change voices anytime without affecting existing calls.

Overview

Voice configuration involves two main components:
  1. Text-to-Speech (TTS): Converts your agent’s responses into natural-sounding audio
  2. Speech-to-Text (STT/ASR): Transcribes user speech into text for the LLM to process

What You’ll Configure

  • TTS Provider: Choose from ElevenLabs, Cartesia, Dasha, Inworld, or LMNT
  • Voice Selection: Pick from hundreds of voices across languages and accents
  • Voice Model: Select the synthesis model (quality vs. speed tradeoff)
  • Speech Speed: Adjust playback rate (provider-dependent)
  • Provider-Specific Options: Fine-tune voice characteristics

ASR Configuration

ASR is Automatic: Speech recognition (ASR/STT) is automatically selected and managed by the BlackBox platform in real-time. There is no user-visible toggle or manual configuration required.
The platform dynamically chooses the best ASR provider based on:
  • User’s detected language and accent
  • Network conditions and latency
  • Call quality metrics
  • Provider availability

TTS Provider Comparison

Quick Reference Table

ProviderBest ForSpeed RangeEmotionsVoice CloningLatency
ElevenLabsNatural quality, customization0.70x - 1.20xVia style parameterYesMedium
CartesiaUltra-low latency, emotions0x - 2.0x (speeds below 0.25x sent as 0.25x)20 emotion levelsNoUltra-low
DashaPlatform-native, widest speed range0.25x - 4.0xNoNoLow
InworldCharacter voices, gaming0.80x - 1.50xVia temperature/pitchNoMedium
LMNTConsistent, lightweightFixed (1.0x)NoNoLow

Provider Deep Dive

ElevenLabs

Strengths:
  • Industry-leading voice quality
  • Extensive voice library (1000+ voices)
  • Advanced customization options
  • Voice cloning support
  • Multilingual capabilities
Available Models:
  • eleven_multilingual_v2 - Multilingual V2 (best quality)
  • eleven_turbo_v2_5 - Turbo V2.5 (balanced)
  • eleven_flash_v2_5 - Flash V2.5 (fastest)
Speed Range: 0.70x to 1.20x (default: 1.0x)Customization Options:
  • Similarity Boost (0.0 - 1.0, default: 0.75)
    • Controls voice consistency with original
    • Higher = more similar to base voice
    • Lower = more variation allowed
  • Stability (0.0 - 1.0, default: 0.5)
    • Controls voice stability across generations
    • Higher = more consistent output
    • Lower = more expressive/variable
  • Style (0.0 - 1.0, default: 0.3)
    • Controls speaker style exaggeration
    • Higher = more stylized delivery
    • Lower = more neutral tone
  • Use Speaker Boost (boolean, default: true)
    • Enhances voice clarity and quality
    • Recommended for most use cases
  • Optimize Streaming Latency (0-4, default: 4)
    • Trades quality for lower latency
    • 0 = highest quality, highest latency
    • 4 = lowest latency, acceptable quality
Best For: Customer support, professional services, brand-specific voicesExample Configuration:
ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
  model: "eleven_turbo_v2_5",
  speed: 1.0,
  vendorSpecificOptions: {
    similarity_boost: 0.8,
    stability: 0.6,
    style: 0.4,
    use_speaker_boost: true,
    optimize_streaming_latency: 3
  }
}

Voice Selection

Browsing Available Voices

Access all provider voices through the Voice & Speech tab:
  1. Navigate: Go to agent creation/editing → Voice & Speech tab
  2. Select Provider: Choose your TTS provider first
  3. Browse Voices: Search and filter available voices
  4. Preview: Listen to voice samples before selecting
  5. Select: Choose your preferred voice
Voice selection interface showing provider and voice picker Voice selection with search and preview functionality

Voice Attributes

Each voice includes metadata:
  • Name: Voice identifier (e.g., “Rachel”, “Mark”)
  • Language: Primary language and locale (e.g., “en-US”, “es-ES”)
  • Gender: Male, Female, or Neutral
  • Description: Voice characteristics and use cases
  • Provider: TTS service (ElevenLabs, Cartesia, etc.)

Custom Voice IDs

You can use voice IDs not listed in the default picker:
  1. Find the voice ID from your TTS provider’s documentation
  2. Click “Use custom voice ID” button below the voice selector
  3. Enter the custom voice ID in the dialog that appears
  4. Preview to verify the voice works correctly
  5. Save your agent configuration
Provider Documentation:

Voice Preview

Using Voice Preview

Before committing to a voice, test it with your actual content:
  1. Enter Preview Text: Type or paste sample text (up to 1000 characters)
  2. Configure Settings: Adjust speed, options as needed
  3. Click Preview: Generate and listen to audio sample
  4. Iterate: Try different voices, speeds, and settings
  5. Select Best Match: Choose the configuration that sounds best
Voice preview widget with text input and play controls Preview voices with custom text before saving

Preview Best Practices

Text Selection:
  • Use representative samples from your agent’s actual responses
  • Include questions, statements, and conversational phrases
  • Test punctuation handling (commas, periods, exclamation points)
  • Try names, numbers, and special terms your agent will use
What to Listen For:
  • Natural pronunciation of domain-specific terms
  • Appropriate pacing for your use case
  • Emotional tone matches your brand
  • Clarity at different speeds
  • Consistent quality across phrases

Character Limit

Voice preview supports up to 1000 characters per preview.
  • Real-time character counter shows remaining characters
  • Warning displays when approaching limit
  • Limit enforced by VOICE_PREVIEW_TEXT_LIMIT constant
Technical Detail: Preview uses the same /api/v1/voice/synthesize endpoint as production calls, ensuring accuracy.

Speed Configuration

Speed Ranges by Provider

Different providers support different speed ranges:
ProviderMin SpeedMax SpeedDefaultRecommended Range
ElevenLabs0.70x1.20x1.0x0.9x - 1.2x
Cartesia0x (sent as 0.25x)2.0x1.0x0.8x - 1.3x
Dasha0.25x4.0x1.0x0.8x - 1.5x
Inworld0.80x1.50x1.0x0.9x - 1.1x
LMNTFixed 1.0xFixed 1.0x1.0x1.0x (only)

Choosing the Right Speed

Slower Speeds (0.7x - 0.9x):
  • Accessibility needs
  • Complex information delivery
  • Educational content
  • Non-native speakers
  • Legal/compliance disclosures
Normal Speed (1.0x):
  • General conversation
  • Customer support
  • Sales calls
  • Most use cases
Faster Speeds (1.1x - 1.5x):
  • Time-sensitive scenarios
  • Familiar/repetitive content
  • High-volume information
  • Experienced users
Speed Extremes: Avoid speeds below 0.7x (too slow, unnatural) or above 1.5x (too fast, hard to understand). Test thoroughly before production.

Voice Cloning

Create custom voices that match your brand identity with ElevenLabs voice cloning.

Cloning Process

  1. Prepare Audio: Record 1-5 minutes of clean voice samples
  2. Upload: Use the voice cloning interface or API
  3. Configure: Set name, description, language
  4. Clone: ElevenLabs processes your samples
  5. Use: Select cloned voice in agent configuration

API-Based Voice Cloning

// Clone a new voice
const formData = new FormData();
formData.append('Name', 'My Brand Voice');
formData.append('Description', 'Custom voice for customer support');
formData.append('Language', 'en-US');
formData.append('Provider', 'ElevenLabs');
formData.append('audioFiles', audioFile); // File object

const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/clone', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: formData
});

const clonedVoice = await response.json();
console.log('Cloned voice ID:', clonedVoice.voiceId);

Managing Cloned Voices

// Update cloned voice description
const voiceId = 'your-cloned-voice-id';
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
  method: 'PATCH',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    description: 'Updated voice description'
  })
});

// Delete cloned voice
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
  method: 'DELETE',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  }
});
For detailed voice cloning instructions, see Voice Cloning.

Speech Recognition (ASR)

How ASR Works in BlackBox

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), converts user audio into text for your agent’s LLM to process. Automatic Provider Selection: BlackBox automatically selects the optimal ASR provider based on:
  1. Language Detection: Identifies user’s spoken language
  2. Accent Recognition: Adjusts for regional variations
  3. Network Quality: Adapts to connection conditions
  4. Provider Performance: Routes to best-performing provider
  5. Real-Time Optimization: Switches providers if quality degrades

Supported ASR Providers

BlackBox integrates with multiple ASR providers for redundancy and quality:
  • Deepgram: High-accuracy, low-latency transcription
  • Microsoft Speech Services: Enterprise-grade recognition
  • Auto (Platform-Managed): Automatic provider selection (recommended)
No Configuration Required: You don’t need to select an ASR provider. The platform handles this automatically for optimal results.

ASR Quality Factors

Several factors affect transcription accuracy: Audio Quality:
  • Clear microphone input
  • Minimal background noise
  • Good network connection
  • Proper audio levels
User Factors:
  • Speech clarity and pace
  • Accent and pronunciation
  • Use of domain-specific terms
  • Speaking patterns
System Optimization:
  • Automatic noise cancellation
  • Echo suppression
  • Acoustic model adaptation
  • Real-time quality monitoring

Improving ASR Accuracy

While ASR is automatic, you can help improve accuracy:
  1. Agent Prompting: Guide users to speak clearly in your agent’s greeting
  2. Confirmation: Have agent repeat understood information for verification
  3. Clarification: Prompt for clarification when confidence is low
  4. Domain Terms: Provide common terms in your agent’s context (coming soon)

Configuration Examples

Simple, production-ready voice setup:
// Via Dashboard: Use defaults in Voice & Speech tab
// Via API:
const agent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    name: "Support Agent",
    config: {
      version: "v1",
      primaryLanguage: "en-US",
      ttsConfig: {
        version: "v1",
        vendor: "ElevenLabs",
        voiceId: "zmcVlqmyk3Jpn5AVYcAL",
        model: "eleven_flash_v2_5",
        speed: 1.0
      },
      sttConfig: {
        version: "v1",
        vendor: "Auto" // Platform manages ASR automatically
      },
      llmConfig: {
        version: "v1",
        vendor: "openai",
        model: "gpt-4.1-mini",
        prompt: "You are a helpful assistant."
      },
      features: {
        version: "v1",
        languageSwitching: {
          version: "v1",
          isEnabled: false
        },
        rag: {
          version: "v1",
          isEnabled: false,
          kbLinks: []
        }
      }
    }
  })
});

ElevenLabs with Customization

High-quality voice with fine-tuned parameters:
ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
  model: "eleven_turbo_v2_5",
  speed: 1.1,
  vendorSpecificOptions: {
    similarity_boost: 0.85,
    stability: 0.7,
    style: 0.4,
    use_speaker_boost: true,
    optimize_streaming_latency: 3
  }
}

Cartesia with Emotions

Low-latency with emotional expressiveness:
ttsConfig: {
  version: "v1",
  vendor: "Cartesia",
  voiceId: "cartesia-friendly-voice",
  model: "sonic",
  speed: 1.2,
  vendorSpecificOptions: {
    emotions: [
      "positivity:high",
      "curiosity:low"
    ]
  }
}

Multilingual Configuration

Agent supporting multiple languages:
config: {
  version: "v1",
  primaryLanguage: "en-US", // Default language
  ttsConfig: {
    version: "v1",
    vendor: "ElevenLabs",
    voiceId: "multilingual-voice-id",
    model: "eleven_multilingual_v2", // Supports 29+ languages
    speed: 1.0
  },
  features: {
    version: "v1",
    languageSwitching: {
      version: "v1",
      isEnabled: true
    }
  }
}

Testing Voice Configuration

Dashboard Testing

Use the built-in test widget to verify voice quality:
  1. Save Agent: Save your voice configuration
  2. Open Test Widget: Click “Test Agent” in dashboard
  3. Start Conversation: Begin voice interaction
  4. Listen Carefully: Evaluate voice quality, speed, clarity
  5. Iterate: Adjust settings and re-test as needed
Dashboard test widget with voice interaction controls Test your agent’s voice directly from the dashboard

Voice Synthesis API Testing

Test TTS synthesis without creating an agent:
// API endpoint: http://localhost:8080 (development) or your production URL
// Synthesize speech using a public ElevenLabs voice
const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/synthesize', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    text: "Hello world, this is a test of the voice synthesis system.",
    provider: "ElevenLabs",
    voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel (public voice)
    model: "eleven_turbo_v2_5",
    language: "en-US",
    speed: 1.0
  })
});

if (response.ok) {
  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}

What to Test

Quality Checklist:
  • Voice sounds natural and professional
  • Speed is comfortable for target audience
  • Pronunciation of key terms is correct
  • Emotional tone matches use case
  • No audio artifacts or glitches
  • Consistent quality across phrases
  • Latency is acceptable for real-time conversation

Best Practices

Voice Selection

  1. Match Your Brand: Choose voices that align with your brand identity
  2. Consider Audience: Select demographics-appropriate voices
  3. Test Multiple Options: Preview 3-5 voices before deciding
  4. Get Feedback: Test with representative users
  5. Document Choice: Note why you selected specific voices

Speed Settings

  1. Start at 1.0x: Use default speed as baseline
  2. Test Incrementally: Adjust in 0.05x increments
  3. Context Matters: Different content may need different speeds
  4. A/B Test: Compare speeds with real users
  5. Monitor Feedback: Track user satisfaction metrics

Provider Selection

Choose ElevenLabs if:
  • You need highest voice quality
  • Brand-specific voice cloning is important
  • Advanced customization is required
  • You can tolerate slightly higher latency
Choose Cartesia if:
  • Real-time responsiveness is critical
  • Emotional expression is important
  • You need ultra-low latency (< 250ms)
  • Conversation feels more important than absolute quality
Choose Dasha if:
  • You want simple, reliable configuration
  • Platform integration is a priority
  • You need widest speed adjustment range
  • Quick setup is important
Choose Inworld if:
  • You’re building character-driven experiences
  • Gaming or interactive media is your use case
  • Voice expressiveness is critical
Choose LMNT if:
  • You need consistent, predictable output
  • Minimal configuration is desired
  • Fixed speed (1.0x) works for your needs

Common Mistakes to Avoid

Avoid These Mistakes:
  1. Not Previewing: Always preview before deploying
  2. Extreme Speeds: Don’t use < 0.7x or > 1.5x without extensive testing
  3. Mismatched Languages: Ensure voice language matches agent language
  4. Over-Optimization: Don’t sacrifice quality for marginal latency gains
  5. Ignoring Feedback: Listen to user complaints about voice quality
  6. Single Voice Testing: Test multiple voices before committing

Troubleshooting

Voice Issues

Problem: Voice sounds robotic or unnatural
  • Solution: Try a different voice from the same provider
  • Solution: Adjust stability/temperature parameters (ElevenLabs/Inworld)
  • Solution: Switch to ElevenLabs for highest quality
Problem: Speech is too fast/slow
  • Solution: Adjust speed setting incrementally
  • Solution: Test with representative users
  • Solution: Consider accessibility needs (slower may be better)
Problem: Pronunciation errors
  • Solution: Use phonetic spelling in system prompt
  • Solution: Try different voice from same provider
  • Solution: Consider voice cloning with correct pronunciation
Problem: High latency before speech starts
  • Solution: Switch to Cartesia for lowest latency
  • Solution: Use ElevenLabs with optimize_streaming_latency: 4
  • Solution: Check network conditions

ASR Issues

Problem: Poor transcription accuracy
  • Solution: No action needed - ASR auto-optimizes
  • Solution: Prompt users to speak clearly in agent greeting
  • Solution: Add confirmation/verification in conversation flow
Problem: Wrong language detected
  • Solution: Ensure agent primaryLanguage matches user language
  • Solution: Enable language switching if supporting multiple languages

Next Steps

API Cross-References