Voice & Speech Configuration

Your agent’s voice is crucial to creating engaging, natural conversations. Dasha BlackBox supports five production-ready TTS providers, each with unique characteristics, voices, and configuration options.

Quick Start: New agents default to ElevenLabs Flash V2.5 (voice ID zmcVlqmyk3Jpn5AVYcAL). You can preview and change voices anytime without affecting existing calls.

Overview

Voice configuration involves two main components:

Text-to-Speech (TTS): Converts your agent’s responses into natural-sounding audio
Speech-to-Text (STT/ASR): Transcribes user speech into text for the LLM to process

What You’ll Configure

TTS Provider: Choose from ElevenLabs, Cartesia, Dasha, Inworld, or LMNT
Voice Selection: Pick from hundreds of voices across languages and accents
Voice Model: Select the synthesis model (quality vs. speed tradeoff)
Speech Speed: Adjust playback rate (provider-dependent)
Provider-Specific Options: Fine-tune voice characteristics

ASR Configuration

ASR is Automatic: Speech recognition (ASR/STT) is automatically selected and managed by the Dasha BlackBox platform in real-time. There is no user-visible toggle or manual configuration required.

The platform dynamically chooses the best ASR provider based on:

User’s detected language and accent
Network conditions and latency
Call quality metrics
Provider availability

TTS Provider Comparison

Quick Reference Table

Provider	Best For	Speed Range	Emotions	Voice Cloning	Latency
ElevenLabs	Natural quality, customization	0.70x - 1.20x	Via style parameter	Yes	Medium
Cartesia	Ultra-low latency, emotions	0x - 2.0x (speeds below 0.25x sent as 0.25x)	20 emotion levels	No	Ultra-low
Dasha	Platform-native, widest speed range	0.25x - 4.0x	No	No	Low
Inworld	Character voices, gaming	0.80x - 1.50x	Via temperature/pitch	No	Medium
LMNT	Consistent, lightweight	Fixed (1.0x)	No	No	Low

Provider Deep Dive

ElevenLabs
Cartesia
Dasha
Inworld
LMNT

ElevenLabs

Strengths:

Industry-leading voice quality
Extensive voice library (1000+ voices)
Advanced customization options
Voice cloning support
Multilingual capabilities

Available Models:

eleven_multilingual_v2 - Multilingual V2 (best quality)
eleven_turbo_v2_5 - Turbo V2.5 (balanced)
eleven_flash_v2_5 - Flash V2.5 (fastest)

Speed Range: 0.70x to 1.20x (default: 1.0x)Customization Options:

Similarity Boost (0.0 - 1.0, default: 0.75)
- Controls voice consistency with original
- Higher = more similar to base voice
- Lower = more variation allowed
Stability (0.0 - 1.0, default: 0.5)
- Controls voice stability across generations
- Higher = more consistent output
- Lower = more expressive/variable
Style (0.0 - 1.0, default: 0.3)
- Controls speaker style exaggeration
- Higher = more stylized delivery
- Lower = more neutral tone
Use Speaker Boost (boolean, default: true)
- Enhances voice clarity and quality
- Recommended for most use cases
Optimize Streaming Latency (0-4, default: 4)
- Trades quality for lower latency
- 0 = highest quality, highest latency
- 4 = lowest latency, acceptable quality

Best For: Customer support, professional services, brand-specific voicesExample Configuration:

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
  model: "eleven_turbo_v2_5",
  speed: 1.0,
  vendorSpecificOptions: {
    similarity_boost: 0.8,
    stability: 0.6,
    style: 0.4,
    use_speaker_boost: true,
    optimize_streaming_latency: 3
  }
}

Cartesia

Strengths:

Ultra-low latency (< 250ms)
Emotion control system
High-speed inference
Great for real-time conversations
Natural conversational flow

Available Models:

sonic - Sonic (only model, optimized for speed)

Speed Range: 0x to 2.0x (default: 1.0x)

Note: Speeds below 0.25x are automatically adjusted to 0.25x due to server limits

Emotion System: Cartesia offers granular control over 5 emotion dimensions, each with 4 intensity levels:

Anger: anger:lowest, anger:low, anger:high, anger:highest
Positivity: positivity:lowest, positivity:low, positivity:high, positivity:highest
Surprise: surprise:lowest, surprise:low, surprise:high, surprise:highest
Sadness: sadness:lowest, sadness:low, sadness:high, sadness:highest
Curiosity: curiosity:lowest, curiosity:low, curiosity:high, curiosity:highest

You can combine multiple emotions for nuanced delivery:

vendorSpecificOptions: {
  emotions: [
    "positivity:high",
    "curiosity:low"
  ]
}

Best For: Real-time applications, emotional responses, conversational AI, gamingExample Configuration:

ttsConfig: {
  version: "v1",
  vendor: "Cartesia",
  voiceId: "cartesia-voice-id",
  model: "sonic",
  speed: 1.2,
  vendorSpecificOptions: {
    emotions: [
      "positivity:high",
      "curiosity:low"
    ]
  }
}

Emotion Tips: Start with subtle emotions (low levels) and adjust based on testing. Combining too many high-intensity emotions can sound unnatural.

Dasha

Strengths:

Platform-native integration
Optimized for Dasha BlackBox infrastructure
Widest speed range (0.25x - 4.0x)
Consistent performance
No additional configuration needed

Available Models:

common - Common (only model)

Speed Range: 0.25x to 4.0x (default: 1.0x)

Widest range among all providers
Useful for accessibility (slower) or time-constrained scenarios (faster)

Customization Options:

No additional options available
Simple, straightforward configuration

Best For: General-purpose agents, quick setup, platform consistencyExample Configuration:

ttsConfig: {
  version: "v1",
  vendor: "Dasha",
  voiceId: "dasha-voice-id",
  model: "common",
  speed: 1.0
}

Platform Integration: Dasha offers reliable performance with minimal configuration and platform-native integration.

Inworld

Strengths:

Character-focused voices
Gaming and interactive media optimized
Temperature and pitch controls
Expressive character delivery

Available Models:

inworld-tts-1 - Inworld TTS 1 (only model)

Speed Range: 0.80x to 1.50x (default: 1.0x)Customization Options:

Temperature (slider, default: 0.8)
- Controls voice expressiveness
- Higher = more expressive/varied
- Lower = more consistent/neutral
Pitch (slider, default: 0.0)
- Adjusts voice pitch
- Positive values = higher pitch
- Negative values = lower pitch

Best For: Gaming NPCs, character-driven experiences, interactive storytellingExample Configuration:

ttsConfig: {
  version: "v1",
  vendor: "Inworld",
  voiceId: "inworld-voice-id",
  model: "inworld-tts-1",
  speed: 1.0,
  vendorSpecificOptions: {
    temperature: 0.9,
    pitch: 0.2
  }
}

LMNT

Strengths:

Consistent, reliable synthesis
Lightweight implementation
Simple configuration
Good quality-to-performance ratio

Available Models:

blizzard - Blizzard (only model)

Speed Range: Fixed at 1.0x (speed control not supported)Customization Options:

No additional options available

Best For: Straightforward deployments, consistent voice output, minimal configuration needsExample Configuration:

ttsConfig: {
  version: "v1",
  vendor: "Lmnt",
  voiceId: "lmnt-voice-id",
  model: "blizzard"
  // Note: No speed control available
}

No Speed Control: LMNT does not support speed adjustment. The voice will always play at 1.0x speed.

TTS Vendor Options Summary

The vendor field in TtsConfig accepts the following values:

Vendor	Value	Description
ElevenLabs	`"ElevenLabs"`	Industry-leading quality, extensive customization
Cartesia	`"Cartesia"`	Ultra-low latency, emotion control
Dasha	`"Dasha"`	Platform-native, widest speed range
Inworld	`"Inworld"`	Character voices, gaming-focused
LMNT	`"Lmnt"`	Consistent, lightweight synthesis

Vendor-Specific Options Reference

Each TTS provider supports different configuration options in vendorSpecificOptions:

ElevenLabs
Cartesia
Inworld
Dasha / LMNT

vendorSpecificOptions: {
  similarity_boost: 0.75,    // 0.0-1.0: Voice consistency
  stability: 0.5,            // 0.0-1.0: Voice stability
  style: 0.3,                // 0.0-1.0: Style exaggeration
  use_speaker_boost: true,   // Boolean: Enhanced clarity
  optimize_streaming_latency: 4  // 0-4: Latency optimization
}

vendorSpecificOptions: {
  emotions: [
    "positivity:high",    // Emotion:intensity pairs
    "curiosity:low"
  ]
}

Emotion Dimensions: anger, positivity, surprise, sadness, curiosityIntensity Levels: lowest, low, high, highest

vendorSpecificOptions: {
  temperature: 0.8,  // Voice expressiveness
  pitch: 0.0         // Pitch adjustment
}

No vendor-specific options available. Use default configuration:

// vendorSpecificOptions not needed
ttsConfig: {
  version: "v1",
  vendor: "Dasha", // or "Lmnt"
  voiceId: "voice-id",
  model: "common"  // or "blizzard" for LMNT
}

Voice Selection

Browsing Available Voices

Access all provider voices through the Voice & Speech tab:

Navigate: Go to agent creation/editing → Voice & Speech tab
Select Provider: Choose your TTS provider first
Browse Voices: Search and filter available voices
Preview: Listen to voice samples before selecting
Select: Choose your preferred voice

Voice selection interface showing provider and voice picker

Voice selection with search and preview functionality

Voice Attributes

Each voice includes metadata:

Name: Voice identifier (e.g., “Rachel”, “Mark”)
Language: Primary language and locale (e.g., “en-US”, “es-ES”)
Gender: Male, Female, or Neutral
Description: Voice characteristics and use cases
Provider: TTS service (ElevenLabs, Cartesia, etc.)

Custom Voice IDs

You can use voice IDs not listed in the default picker:

Find the voice ID from your TTS provider’s documentation
Click “Use custom voice ID” button below the voice selector
Enter the custom voice ID in the dialog that appears
Preview to verify the voice works correctly
Save your agent configuration

Provider Documentation:

ElevenLabs Voice Library
Cartesia, Dasha, Inworld, LMNT: Contact support for voice catalogs

Voice Preview

Using Voice Preview

Before committing to a voice, test it with your actual content:

Enter Preview Text: Type or paste sample text (up to 1000 characters)
Configure Settings: Adjust speed, options as needed
Click Preview: Generate and listen to audio sample
Iterate: Try different voices, speeds, and settings
Select Best Match: Choose the configuration that sounds best

Voice preview widget with text input and play controls

Preview voices with custom text before saving

Preview Best Practices

Text Selection:

Use representative samples from your agent’s actual responses
Include questions, statements, and conversational phrases
Test punctuation handling (commas, periods, exclamation points)
Try names, numbers, and special terms your agent will use

What to Listen For:

Natural pronunciation of domain-specific terms
Appropriate pacing for your use case
Emotional tone matches your brand
Clarity at different speeds
Consistent quality across phrases

Character Limit

Voice preview supports up to 1000 characters per preview.

Real-time character counter shows remaining characters
Warning displays when approaching limit
Limit enforced by VOICE_PREVIEW_TEXT_LIMIT constant

Technical Detail: Preview uses the same /api/v1/voice/synthesize endpoint as production calls, ensuring accuracy.

TTS Responsiveness

What is Responsiveness?

Responsiveness controls how quickly your agent begins speaking after the user finishes talking. This parameter directly affects the delay before the agent’s response. Location in Configuration: ttsConfig.responsiveness Type: Number (0 to 1, optional) Default: Platform-managed

How Responsiveness Works

Important: The responsiveness value controls the delay before the agent responds. A value of 1.0 provides the most responsive agent with minimal delay. Lower values add artificial delay before the agent begins speaking.

Value	Behavior	Response Delay
1.0	Most responsive (recommended)	Minimal delay before response
0.7	Slightly delayed	Small delay added before response
0.5	Moderately delayed	Moderate delay added before response
0.3	Significantly delayed	Longer delay added before response
0.0	Maximum delay	Longest delay before response

Recommended Configuration

For most use cases, 1.0 is the recommended value as it provides the most natural, responsive conversation experience with minimal delay.

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  responsiveness: 1.0 // Recommended: most responsive
}

When to Use Lower Values

Lower responsiveness values (adding delay) may be useful in specific scenarios:

IVR-style systems where a slight pause feels more natural
Accessibility requirements where users need extra processing time
Compliance scenarios where pacing requirements exist

Technical Detail: Values below 1.0 add an artificial delay before the agent begins responding. This does not affect speech speed (controlled by the speed parameter) or voice quality—it only delays when the response starts.

Example Configurations

Optimal (1.0)
With Delay (0.7)

Recommended for most use cases:

Customer support
Sales calls
Appointment scheduling
General conversational scenarios

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  responsiveness: 1.0 // Most responsive, minimal delay
}

When a slight pause is desired:

IVR-style interactions
Specific accessibility needs

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  responsiveness: 0.7 // Adds slight delay before response
}

Best Practice: Start with 1.0 (the most responsive setting) and only reduce it if you have a specific requirement for adding delay before responses. Most conversational AI applications perform best with minimal response delay.

Speed Configuration

Speed Ranges by Provider

Different providers support different speed ranges:

Provider	Min Speed	Max Speed	Default	Recommended Range
ElevenLabs	0.70x	1.20x	1.0x	0.9x - 1.2x
Cartesia	0x (sent as 0.25x)	2.0x	1.0x	0.8x - 1.3x
Dasha	0.25x	4.0x	1.0x	0.8x - 1.5x
Inworld	0.80x	1.50x	1.0x	0.9x - 1.1x
LMNT	Fixed 1.0x	Fixed 1.0x	1.0x	1.0x (only)

Choosing the Right Speed

Slower Speeds (0.7x - 0.9x):

Accessibility needs
Complex information delivery
Educational content
Non-native speakers
Legal/compliance disclosures

Normal Speed (1.0x):

General conversation
Customer support
Sales calls
Most use cases

Faster Speeds (1.1x - 1.5x):

Time-sensitive scenarios
Familiar/repetitive content
High-volume information
Experienced users

Speed Extremes: Avoid speeds below 0.7x (too slow, unnatural) or above 1.5x (too fast, hard to understand). Test thoroughly before production.

Dynamic Speed Adjustment

Dasha BlackBox supports dynamic speed adjustment during conversations, allowing the agent to adapt its speaking pace based on user requests (e.g., “Can you speak more slowly?”). Location in Configuration: ttsConfig.speedAdjustment

SpeedAdjustmentSettings

Field	Type	Default	Description
`version`	string	”v1”	Configuration version
`strategy`	SpeedAdjustment	”OnRequest”	Speed adjustment strategy

SpeedAdjustment Strategies

Strategy	Description
`"OnRequest"`	(Default) Agent can adjust speed when user requests it
`"Disabled"`	Speed remains fixed at the configured value

Relationship to Base Speed

The speedAdjustment setting works in conjunction with the base speed parameter:

Base speed: Sets the initial/default speech rate (e.g., 1.0x)
speedAdjustment.strategy: Controls whether the agent can deviate from the base speed during conversation

Example Configuration:

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  speed: 1.0, // Base speed
  speedAdjustment: {
    version: "v1",
    strategy: "OnRequest" // Allow user to request speed changes
  }
}

Disabled Speed Adjustment:

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  speed: 1.1, // Fixed at 1.1x
  speedAdjustment: {
    version: "v1",
    strategy: "Disabled" // Always use base speed
  }
}

When to Disable: Consider disabling speed adjustment for compliance-focused applications where consistent delivery speed is required for legal or regulatory reasons.

Voice Cloning

Create custom voices that match your brand identity with ElevenLabs voice cloning.

Cloning Process

Prepare Audio: Record 1-5 minutes of clean voice samples
Upload: Use the voice cloning interface or API
Configure: Set name, description, language
Clone: ElevenLabs processes your samples
Use: Select cloned voice in agent configuration

API-Based Voice Cloning

// Clone a new voice
const formData = new FormData();
formData.append('Name', 'My Brand Voice');
formData.append('Description', 'Custom voice for customer support');
formData.append('Language', 'en-US');
formData.append('Provider', 'ElevenLabs');
formData.append('audioFiles', audioFile); // File object

const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/clone', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: formData
});

const clonedVoice = await response.json();
console.log('Cloned voice ID:', clonedVoice.voiceId);

Managing Cloned Voices

// Update cloned voice description
const voiceId = 'your-cloned-voice-id';
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
  method: 'PATCH',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    description: 'Updated voice description'
  })
});

// Delete cloned voice
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
  method: 'DELETE',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  }
});

For detailed voice cloning instructions, see Voice Cloning.

Pronunciation Dictionary

Pronunciation dictionaries allow you to customize how your agent pronounces specific words or phrases. This is useful for brand names, technical terms, acronyms, or any words the TTS provider might mispronounce. Location in Configuration: ttsConfig.pronunciationDictionary

PronunciationDictionaryReference

The pronunciationDictionary field in TtsConfig references a pre-created pronunciation dictionary by its ID.

Field	Type	Required	Description
`id`	string	Yes	Unique identifier of the pronunciation dictionary
`hash`	string	Yes	Content hash for version tracking

Supported Rule Types

Pronunciation dictionaries support two types of rules:

Rule Type	Description	Use Case
Alias	Replace one word/phrase with another	Acronyms, abbreviations, alternate spellings
Phoneme	Specify exact phonetic pronunciation	Precise control over difficult words

Supported Providers

Pronunciation dictionaries are supported by:

ElevenLabs: Full support for alias rules
Cartesia: Full support for alias and phoneme rules

Other TTS providers (Dasha, Inworld, LMNT) do not currently support pronunciation dictionaries. The configuration will be ignored for these providers.

Example Configuration

Referencing a Pronunciation Dictionary:

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "your-voice-id",
  model: "eleven_turbo_v2_5",
  speed: 1.0,
  pronunciationDictionary: {
    id: "pd_abc123def456",
    hash: "a1b2c3d4e5f6"
  }
}

Creating Pronunciation Dictionaries

Pronunciation dictionaries are created and managed through the API:

// Create a pronunciation dictionary
const response = await fetch('https://blackbox.dasha.ai/api/v1/pronunciation-dictionaries', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    name: "Company Terms",
    provider: "Cartesia",
    rules: [
      {
        type: "alias",
        stringToReplace: "API",
        replacement: "A P I"
      },
      {
        type: "alias",
        stringToReplace: "SQL",
        replacement: "sequel"
      },
      {
        type: "phoneme",
        stringToReplace: "Dasha",
        phoneme: "ˈdɑːʃə",
        alphabet: "ipa"
      }
    ]
  })
});

const dictionary = await response.json();
console.log('Dictionary ID:', dictionary.id);

Common Use Cases

Acronyms and Abbreviations:

“API” → “A P I” (spell out) or “ay pee eye”
“SQL” → “sequel” or “S Q L”
“CEO” → “C E O”

Brand Names:

“Nike” → phoneme for correct pronunciation
“Dasha” → phoneme “ˈdɑːʃə”

Technical Terms:

“kubectl” → “kube control” or “kube C T L”
“nginx” → “engine X”

Best Practice: Create a single, comprehensive pronunciation dictionary for your organization and reference it across all agents. This ensures consistent pronunciation across your entire voice AI deployment.

API Cross-References

POST /api/v1/pronunciation-dictionaries - Create new dictionary
GET /api/v1/pronunciation-dictionaries - List all dictionaries
GET /api/v1/pronunciation-dictionaries/{id} - Get dictionary details
DELETE /api/v1/pronunciation-dictionaries/{id} - Delete dictionary

Speech Recognition (ASR)

How ASR Works in Dasha BlackBox

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), converts user audio into text for your agent’s LLM to process. Automatic Provider Selection: Dasha BlackBox automatically selects the optimal ASR provider based on:

Language Detection: Identifies user’s spoken language
Accent Recognition: Adjusts for regional variations
Network Quality: Adapts to connection conditions
Provider Performance: Routes to best-performing provider
Real-Time Optimization: Switches providers if quality degrades

Supported ASR Providers

Dasha BlackBox integrates with multiple ASR providers for redundancy and quality:

Deepgram: High-accuracy, low-latency transcription
Microsoft Speech Services: Enterprise-grade recognition
Auto (Platform-Managed): Automatic provider selection (recommended)

No Configuration Required: You don’t need to select an ASR provider. The platform handles this automatically for optimal results.

ASR Quality Factors

Several factors affect transcription accuracy: Audio Quality:

Clear microphone input
Minimal background noise
Good network connection
Proper audio levels

User Factors:

Speech clarity and pace
Accent and pronunciation
Use of domain-specific terms
Speaking patterns

System Optimization:

Automatic noise cancellation
Echo suppression
Acoustic model adaptation
Real-time quality monitoring

Improving ASR Accuracy

While ASR is automatic, you can help improve accuracy:

Agent Prompting: Guide users to speak clearly in your agent’s greeting
Confirmation: Have agent repeat understood information for verification
Clarification: Prompt for clarification when confidence is low
STT Keywords: Configure keywords to boost recognition of domain-specific terms (see below)

STT Keywords

STT Keywords allow you to improve speech recognition accuracy for domain-specific terms, product names, proper nouns, and industry jargon. By providing a list of keywords, you help the ASR system prioritize recognition of these specific terms. Location in Configuration: sttConfig.keywords Type: Array of SttKeyword objects (optional)

SttKeyword Structure

Field	Type	Required	Description
`keyword`	string	Yes	The word or phrase to boost recognition for
`weight`	number	No	Recognition priority boost (higher = more likely to be recognized)

How Weight Works

The weight parameter influences how the ASR system prioritizes recognizing certain words:

No weight specified: Standard boost for the keyword
Higher weight values: Increased priority for recognition (e.g., 0.7, 1.0)
Lower weight values: Slight boost, less aggressive prioritization

Choosing Weights: Start without weights for most keywords. Add weights (0.7 - 1.0) only for critical terms that are frequently misrecognized.

Use Cases for STT Keywords

Use Case	Example Keywords
Product Names	”Dasha BlackBox”, “iPhone Pro Max”, “Model S Plaid”
Medical Terms	”hypertension”, “metformin”, “echocardiogram”
Proper Nouns	”Dasha AI”, “Anthropic”, “OpenAI”
Industry Jargon	”SaaS”, “ARR”, “churn rate”, “LTV”
Company-Specific	Internal project names, employee names, location names
Technical Terms	”API endpoint”, “webhook”, “OAuth”

Example Configurations

Basic Keywords (No Weights):

sttConfig: {
  version: "v1",
  vendor: "Auto",
  keywords: [
    { keyword: "Dasha BlackBox" },
    { keyword: "Dasha AI" },
    { keyword: "webhook" }
  ]
}

Keywords with Weights:

sttConfig: {
  version: "v1",
  vendor: "Auto",
  keywords: [
    { keyword: "Dasha BlackBox", weight: 0.7 },
    { keyword: "Dasha AI", weight: 0.7 },
    { keyword: "metformin", weight: 1.0 },
    { keyword: "hypertension" },
    { keyword: "API endpoint" }
  ]
}

Medical/Healthcare Example:

sttConfig: {
  version: "v1",
  vendor: "Auto",
  keywords: [
    { keyword: "lisinopril", weight: 1.0 },
    { keyword: "metoprolol", weight: 1.0 },
    { keyword: "echocardiogram", weight: 0.7 },
    { keyword: "systolic", weight: 0.7 },
    { keyword: "diastolic", weight: 0.7 },
    { keyword: "blood pressure" }
  ]
}

Financial Services Example:

sttConfig: {
  version: "v1",
  vendor: "Auto",
  keywords: [
    { keyword: "401k", weight: 0.7 },
    { keyword: "Roth IRA", weight: 0.7 },
    { keyword: "rollover" },
    { keyword: "beneficiary" },
    { keyword: "contribution limit" }
  ]
}

Keyword Limits: While there’s no strict limit, using too many keywords (50+) may reduce their effectiveness. Focus on the most critical terms that are frequently misrecognized.

Configuration Examples

Basic Configuration (Recommended)

Simple, production-ready voice setup:

// Via Dashboard: Use defaults in Voice & Speech tab
// Via API:
const agent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    name: "Support Agent",
    config: {
      version: "v1",
      primaryLanguage: "en-US",
      ttsConfig: {
        version: "v1",
        vendor: "ElevenLabs",
        voiceId: "zmcVlqmyk3Jpn5AVYcAL",
        model: "eleven_flash_v2_5",
        speed: 1.0
      },
      sttConfig: {
        version: "v1",
        vendor: "Auto" // Platform manages ASR automatically
      },
      llmConfig: {
        version: "v1",
        vendor: "openai",
        model: "gpt-4.1-mini",
        prompt: "You are a helpful assistant."
      },
      features: {
        version: "v1",
        languageSwitching: {
          version: "v1",
          isEnabled: false
        },
        rag: {
          version: "v1",
          isEnabled: false,
          kbLinks: []
        }
      }
    }
  })
});

ElevenLabs with Customization

High-quality voice with fine-tuned parameters:

ttsConfig: {
  version: "v1",
  vendor: "ElevenLabs",
  voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
  model: "eleven_turbo_v2_5",
  speed: 1.1,
  vendorSpecificOptions: {
    similarity_boost: 0.85,
    stability: 0.7,
    style: 0.4,
    use_speaker_boost: true,
    optimize_streaming_latency: 3
  }
}

Cartesia with Emotions

Low-latency with emotional expressiveness:

ttsConfig: {
  version: "v1",
  vendor: "Cartesia",
  voiceId: "cartesia-friendly-voice",
  model: "sonic",
  speed: 1.2,
  vendorSpecificOptions: {
    emotions: [
      "positivity:high",
      "curiosity:low"
    ]
  }
}

Multilingual Configuration

Agent supporting multiple languages:

config: {
  version: "v1",
  primaryLanguage: "en-US", // Default language
  ttsConfig: {
    version: "v1",
    vendor: "ElevenLabs",
    voiceId: "multilingual-voice-id",
    model: "eleven_multilingual_v2", // Supports 29+ languages
    speed: 1.0
  },
  features: {
    version: "v1",
    languageSwitching: {
      version: "v1",
      isEnabled: true
    }
  }
}

Testing Voice Configuration

Dashboard Testing

Use the built-in test widget to verify voice quality:

Save Agent: Save your voice configuration
Open Test Widget: Click “Test Agent” in dashboard
Start Conversation: Begin voice interaction
Listen Carefully: Evaluate voice quality, speed, clarity
Iterate: Adjust settings and re-test as needed

Dashboard test widget with voice interaction controls

Test your agent’s voice directly from the dashboard

Voice Synthesis API Testing

Test TTS synthesis without creating an agent:

// API endpoint: http://localhost:8080 (development) or your production URL
// Synthesize speech using a public ElevenLabs voice
const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/synthesize', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    text: "Hello world, this is a test of the voice synthesis system.",
    provider: "ElevenLabs",
    voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel (public voice)
    model: "eleven_turbo_v2_5",
    language: "en-US",
    speed: 1.0
  })
});

if (response.ok) {
  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}

What to Test

Quality Checklist:

Voice sounds natural and professional
Speed is comfortable for target audience
Pronunciation of key terms is correct
Emotional tone matches use case
No audio artifacts or glitches
Consistent quality across phrases
Latency is acceptable for real-time conversation

Best Practices

Voice Selection

Match Your Brand: Choose voices that align with your brand identity
Consider Audience: Select demographics-appropriate voices
Test Multiple Options: Preview 3-5 voices before deciding
Get Feedback: Test with representative users
Document Choice: Note why you selected specific voices

Speed Settings

Start at 1.0x: Use default speed as baseline
Test Incrementally: Adjust in 0.05x increments
Context Matters: Different content may need different speeds
A/B Test: Compare speeds with real users
Monitor Feedback: Track user satisfaction metrics

Provider Selection

Choose ElevenLabs if:

You need highest voice quality
Brand-specific voice cloning is important
Advanced customization is required
You can tolerate slightly higher latency

Choose Cartesia if:

Real-time responsiveness is critical
Emotional expression is important
You need ultra-low latency (< 250ms)
Conversation feels more important than absolute quality

Choose Dasha if:

You want simple, reliable configuration
Platform integration is a priority
You need widest speed adjustment range
Quick setup is important

Choose Inworld if:

You’re building character-driven experiences
Gaming or interactive media is your use case
Voice expressiveness is critical

Choose LMNT if:

You need consistent, predictable output
Minimal configuration is desired
Fixed speed (1.0x) works for your needs

Common Mistakes to Avoid

Avoid These Mistakes:

Not Previewing: Always preview before deploying
Extreme Speeds: Don’t use < 0.7x or > 1.5x without extensive testing
Mismatched Languages: Ensure voice language matches agent language
Over-Optimization: Don’t sacrifice quality for marginal latency gains
Ignoring Feedback: Listen to user complaints about voice quality
Single Voice Testing: Test multiple voices before committing

Troubleshooting

Voice Issues

Problem: Voice sounds robotic or unnatural

Solution: Try a different voice from the same provider
Solution: Adjust stability/temperature parameters (ElevenLabs/Inworld)
Solution: Switch to ElevenLabs for highest quality

Problem: Speech is too fast/slow

Solution: Adjust speed setting incrementally
Solution: Test with representative users
Solution: Consider accessibility needs (slower may be better)

Problem: Pronunciation errors

Solution: Use phonetic spelling in system prompt
Solution: Try different voice from same provider
Solution: Consider voice cloning with correct pronunciation

Problem: High latency before speech starts

Solution: Switch to Cartesia for lowest latency
Solution: Use ElevenLabs with optimize_streaming_latency: 4
Solution: Check network conditions

ASR Issues

Problem: Poor transcription accuracy

Solution: No action needed - ASR auto-optimizes
Solution: Prompt users to speak clearly in agent greeting
Solution: Add confirmation/verification in conversation flow

Problem: Wrong language detected

Solution: Ensure agent primaryLanguage matches user language
Solution: Enable language switching if supporting multiple languages

Next Steps

Test Voice Settings - Detailed voice testing guide
Voice Cloning - Create custom branded voices
Advanced Features - Language switching and more
Tools & Functions - Add capabilities to your agent

API Cross-References

GET /api/v1/voice - List all available voices
POST /api/v1/voice/synthesize - Synthesize speech for testing
POST /api/v1/voice/clone - Clone custom voices
PATCH /api/v1/voice/clone/{voiceId} - Update cloned voice
DELETE /api/v1/voice/clone/{voiceId} - Delete cloned voice

Introduction

Build

Deploy

Monitor

Webhooks & Events

Advanced

WebSockets

​Voice & Speech Configuration

​Overview

​What You’ll Configure

​ASR Configuration

​TTS Provider Comparison

​Quick Reference Table

​Provider Deep Dive

​ElevenLabs

​Cartesia

​Dasha

​Inworld

​LMNT

​TTS Vendor Options Summary

​Vendor-Specific Options Reference

​Voice Selection

​Browsing Available Voices

​Voice Attributes

​Custom Voice IDs

​Voice Preview

​Using Voice Preview

​Preview Best Practices

​Character Limit

​TTS Responsiveness

​What is Responsiveness?

​How Responsiveness Works

​Recommended Configuration

​When to Use Lower Values

​Example Configurations

​Speed Configuration

​Speed Ranges by Provider

​Choosing the Right Speed

​Dynamic Speed Adjustment

​SpeedAdjustmentSettings

​SpeedAdjustment Strategies

​Relationship to Base Speed

​Voice Cloning

​Cloning Process

​API-Based Voice Cloning

​Managing Cloned Voices

​Pronunciation Dictionary

​PronunciationDictionaryReference

​Supported Rule Types

​Supported Providers

​Example Configuration

​Creating Pronunciation Dictionaries

​Common Use Cases

​API Cross-References

​Speech Recognition (ASR)

​How ASR Works in Dasha BlackBox

​Supported ASR Providers

​ASR Quality Factors

​Improving ASR Accuracy

​STT Keywords

​SttKeyword Structure

​How Weight Works

​Use Cases for STT Keywords

​Example Configurations

​Configuration Examples

​Basic Configuration (Recommended)

​ElevenLabs with Customization

​Cartesia with Emotions

​Multilingual Configuration

​Testing Voice Configuration

​Dashboard Testing

​Voice Synthesis API Testing

​What to Test

​Best Practices

​Voice Selection

​Speed Settings

​Provider Selection

​Common Mistakes to Avoid

​Troubleshooting

​Voice Issues

Voice & Speech Configuration

Overview

What You’ll Configure

ASR Configuration

TTS Provider Comparison

Quick Reference Table

Provider Deep Dive

ElevenLabs

Cartesia

Dasha

Inworld

LMNT

TTS Vendor Options Summary

Vendor-Specific Options Reference

Voice Selection

Browsing Available Voices

Voice Attributes

Custom Voice IDs

Voice Preview

Using Voice Preview

Preview Best Practices

Character Limit

TTS Responsiveness

What is Responsiveness?

How Responsiveness Works

Recommended Configuration

When to Use Lower Values

Example Configurations

Speed Configuration

Speed Ranges by Provider

Choosing the Right Speed

Dynamic Speed Adjustment

SpeedAdjustmentSettings

SpeedAdjustment Strategies

Relationship to Base Speed

Voice Cloning

Cloning Process

API-Based Voice Cloning

Managing Cloned Voices

Pronunciation Dictionary

PronunciationDictionaryReference

Supported Rule Types

Supported Providers

Example Configuration

Creating Pronunciation Dictionaries

Common Use Cases

API Cross-References

Speech Recognition (ASR)

How ASR Works in Dasha BlackBox

Supported ASR Providers

ASR Quality Factors

Improving ASR Accuracy

STT Keywords

SttKeyword Structure

How Weight Works

Use Cases for STT Keywords

Example Configurations

Configuration Examples

Basic Configuration (Recommended)

ElevenLabs with Customization

Cartesia with Emotions

Multilingual Configuration

Testing Voice Configuration

Dashboard Testing

Voice Synthesis API Testing

What to Test

Best Practices

Voice Selection

Speed Settings

Provider Selection

Common Mistakes to Avoid

Troubleshooting

Voice Issues

ASR Issues

Next Steps

API Cross-References