Voice & Speech Configuration
Your agent’s voice is crucial to creating engaging, natural conversations. Dasha BlackBox supports five production-ready TTS providers, each with unique characteristics, voices, and configuration options.
Quick Start: New agents default to ElevenLabs Flash V2.5 (voice ID zmcVlqmyk3Jpn5AVYcAL). You can preview and change voices anytime without affecting existing calls.
Overview
Voice configuration involves two main components:
- Text-to-Speech (TTS): Converts your agent’s responses into natural-sounding audio
- Speech-to-Text (STT/ASR): Transcribes user speech into text for the LLM to process
- TTS Provider: Choose from ElevenLabs, Cartesia, Dasha, Inworld, or LMNT
- Voice Selection: Pick from hundreds of voices across languages and accents
- Voice Model: Select the synthesis model (quality vs. speed tradeoff)
- Speech Speed: Adjust playback rate (provider-dependent)
- Provider-Specific Options: Fine-tune voice characteristics
ASR Configuration
ASR is Automatic: Speech recognition (ASR/STT) is automatically selected and managed by the Dasha BlackBox platform in real-time. There is no user-visible toggle or manual configuration required.
The platform dynamically chooses the best ASR provider based on:
- User’s detected language and accent
- Network conditions and latency
- Call quality metrics
- Provider availability
TTS Provider Comparison
Quick Reference Table
| Provider | Best For | Speed Range | Emotions | Voice Cloning | Latency |
|---|
| ElevenLabs | Natural quality, customization | 0.70x - 1.20x | Via style parameter | Yes | Medium |
| Cartesia | Ultra-low latency, emotions | 0x - 2.0x (speeds below 0.25x sent as 0.25x) | 20 emotion levels | No | Ultra-low |
| Dasha | Platform-native, widest speed range | 0.25x - 4.0x | No | No | Low |
| Inworld | Character voices, gaming | 0.80x - 1.50x | Via temperature/pitch | No | Medium |
| LMNT | Consistent, lightweight | Fixed (1.0x) | No | No | Low |
Provider Deep Dive
ElevenLabs
Cartesia
Dasha
Inworld
LMNT
ElevenLabs
Strengths:
- Industry-leading voice quality
- Extensive voice library (1000+ voices)
- Advanced customization options
- Voice cloning support
- Multilingual capabilities
Available Models:
eleven_multilingual_v2 - Multilingual V2 (best quality)
eleven_turbo_v2_5 - Turbo V2.5 (balanced)
eleven_flash_v2_5 - Flash V2.5 (fastest)
Speed Range: 0.70x to 1.20x (default: 1.0x)Customization Options:
-
Similarity Boost (0.0 - 1.0, default: 0.75)
- Controls voice consistency with original
- Higher = more similar to base voice
- Lower = more variation allowed
-
Stability (0.0 - 1.0, default: 0.5)
- Controls voice stability across generations
- Higher = more consistent output
- Lower = more expressive/variable
-
Style (0.0 - 1.0, default: 0.3)
- Controls speaker style exaggeration
- Higher = more stylized delivery
- Lower = more neutral tone
-
Use Speaker Boost (boolean, default: true)
- Enhances voice clarity and quality
- Recommended for most use cases
-
Optimize Streaming Latency (0-4, default: 4)
- Trades quality for lower latency
- 0 = highest quality, highest latency
- 4 = lowest latency, acceptable quality
Best For: Customer support, professional services, brand-specific voicesExample Configuration:ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
model: "eleven_turbo_v2_5",
speed: 1.0,
vendorSpecificOptions: {
similarity_boost: 0.8,
stability: 0.6,
style: 0.4,
use_speaker_boost: true,
optimize_streaming_latency: 3
}
}
Cartesia
Strengths:
- Ultra-low latency (< 250ms)
- Emotion control system
- High-speed inference
- Great for real-time conversations
- Natural conversational flow
Available Models:
sonic - Sonic (only model, optimized for speed)
Speed Range: 0x to 2.0x (default: 1.0x)
- Note: Speeds below 0.25x are automatically adjusted to 0.25x due to server limits
Emotion System:
Cartesia offers granular control over 5 emotion dimensions, each with 4 intensity levels:
- Anger:
anger:lowest, anger:low, anger:high, anger:highest
- Positivity:
positivity:lowest, positivity:low, positivity:high, positivity:highest
- Surprise:
surprise:lowest, surprise:low, surprise:high, surprise:highest
- Sadness:
sadness:lowest, sadness:low, sadness:high, sadness:highest
- Curiosity:
curiosity:lowest, curiosity:low, curiosity:high, curiosity:highest
You can combine multiple emotions for nuanced delivery:vendorSpecificOptions: {
emotions: [
"positivity:high",
"curiosity:low"
]
}
Best For: Real-time applications, emotional responses, conversational AI, gamingExample Configuration:ttsConfig: {
version: "v1",
vendor: "Cartesia",
voiceId: "cartesia-voice-id",
model: "sonic",
speed: 1.2,
vendorSpecificOptions: {
emotions: [
"positivity:high",
"curiosity:low"
]
}
}
Emotion Tips: Start with subtle emotions (low levels) and adjust based on testing. Combining too many high-intensity emotions can sound unnatural.
Dasha
Strengths:
- Platform-native integration
- Optimized for Dasha BlackBox infrastructure
- Widest speed range (0.25x - 4.0x)
- Consistent performance
- No additional configuration needed
Available Models:
common - Common (only model)
Speed Range: 0.25x to 4.0x (default: 1.0x)
- Widest range among all providers
- Useful for accessibility (slower) or time-constrained scenarios (faster)
Customization Options:
- No additional options available
- Simple, straightforward configuration
Best For: General-purpose agents, quick setup, platform consistencyExample Configuration:ttsConfig: {
version: "v1",
vendor: "Dasha",
voiceId: "dasha-voice-id",
model: "common",
speed: 1.0
}
Platform Integration: Dasha offers reliable performance with minimal configuration and platform-native integration.
Inworld
Strengths:
- Character-focused voices
- Gaming and interactive media optimized
- Temperature and pitch controls
- Expressive character delivery
Available Models:
inworld-tts-1 - Inworld TTS 1 (only model)
Speed Range: 0.80x to 1.50x (default: 1.0x)Customization Options:
-
Temperature (slider, default: 0.8)
- Controls voice expressiveness
- Higher = more expressive/varied
- Lower = more consistent/neutral
-
Pitch (slider, default: 0.0)
- Adjusts voice pitch
- Positive values = higher pitch
- Negative values = lower pitch
Best For: Gaming NPCs, character-driven experiences, interactive storytellingExample Configuration:ttsConfig: {
version: "v1",
vendor: "Inworld",
voiceId: "inworld-voice-id",
model: "inworld-tts-1",
speed: 1.0,
vendorSpecificOptions: {
temperature: 0.9,
pitch: 0.2
}
}
LMNT
Strengths:
- Consistent, reliable synthesis
- Lightweight implementation
- Simple configuration
- Good quality-to-performance ratio
Available Models:
blizzard - Blizzard (only model)
Speed Range: Fixed at 1.0x (speed control not supported)Customization Options:
- No additional options available
Best For: Straightforward deployments, consistent voice output, minimal configuration needsExample Configuration:ttsConfig: {
version: "v1",
vendor: "Lmnt",
voiceId: "lmnt-voice-id",
model: "blizzard"
// Note: No speed control available
}
No Speed Control: LMNT does not support speed adjustment. The voice will always play at 1.0x speed.
TTS Vendor Options Summary
The vendor field in TtsConfig accepts the following values:
| Vendor | Value | Description |
|---|
| ElevenLabs | "ElevenLabs" | Industry-leading quality, extensive customization |
| Cartesia | "Cartesia" | Ultra-low latency, emotion control |
| Dasha | "Dasha" | Platform-native, widest speed range |
| Inworld | "Inworld" | Character voices, gaming-focused |
| LMNT | "Lmnt" | Consistent, lightweight synthesis |
Vendor-Specific Options Reference
Each TTS provider supports different configuration options in vendorSpecificOptions:
ElevenLabs
Cartesia
Inworld
Dasha / LMNT
vendorSpecificOptions: {
similarity_boost: 0.75, // 0.0-1.0: Voice consistency
stability: 0.5, // 0.0-1.0: Voice stability
style: 0.3, // 0.0-1.0: Style exaggeration
use_speaker_boost: true, // Boolean: Enhanced clarity
optimize_streaming_latency: 4 // 0-4: Latency optimization
}
vendorSpecificOptions: {
emotions: [
"positivity:high", // Emotion:intensity pairs
"curiosity:low"
]
}
Emotion Dimensions: anger, positivity, surprise, sadness, curiosityIntensity Levels: lowest, low, high, highestvendorSpecificOptions: {
temperature: 0.8, // Voice expressiveness
pitch: 0.0 // Pitch adjustment
}
No vendor-specific options available. Use default configuration:// vendorSpecificOptions not needed
ttsConfig: {
version: "v1",
vendor: "Dasha", // or "Lmnt"
voiceId: "voice-id",
model: "common" // or "blizzard" for LMNT
}
Voice Selection
Browsing Available Voices
Access all provider voices through the Voice & Speech tab:
- Navigate: Go to agent creation/editing → Voice & Speech tab
- Select Provider: Choose your TTS provider first
- Browse Voices: Search and filter available voices
- Preview: Listen to voice samples before selecting
- Select: Choose your preferred voice
Voice selection with search and preview functionality
Voice Attributes
Each voice includes metadata:
- Name: Voice identifier (e.g., “Rachel”, “Mark”)
- Language: Primary language and locale (e.g., “en-US”, “es-ES”)
- Gender: Male, Female, or Neutral
- Description: Voice characteristics and use cases
- Provider: TTS service (ElevenLabs, Cartesia, etc.)
Custom Voice IDs
You can use voice IDs not listed in the default picker:
- Find the voice ID from your TTS provider’s documentation
- Click “Use custom voice ID” button below the voice selector
- Enter the custom voice ID in the dialog that appears
- Preview to verify the voice works correctly
- Save your agent configuration
Voice Preview
Using Voice Preview
Before committing to a voice, test it with your actual content:
- Enter Preview Text: Type or paste sample text (up to 1000 characters)
- Configure Settings: Adjust speed, options as needed
- Click Preview: Generate and listen to audio sample
- Iterate: Try different voices, speeds, and settings
- Select Best Match: Choose the configuration that sounds best
Preview voices with custom text before saving
Preview Best Practices
Text Selection:
- Use representative samples from your agent’s actual responses
- Include questions, statements, and conversational phrases
- Test punctuation handling (commas, periods, exclamation points)
- Try names, numbers, and special terms your agent will use
What to Listen For:
- Natural pronunciation of domain-specific terms
- Appropriate pacing for your use case
- Emotional tone matches your brand
- Clarity at different speeds
- Consistent quality across phrases
Character Limit
Voice preview supports up to 1000 characters per preview.
- Real-time character counter shows remaining characters
- Warning displays when approaching limit
- Limit enforced by
VOICE_PREVIEW_TEXT_LIMIT constant
Technical Detail: Preview uses the same /api/v1/voice/synthesize endpoint as production calls, ensuring accuracy.
TTS Responsiveness
What is Responsiveness?
Responsiveness controls how quickly your agent begins speaking after the user finishes talking. This parameter directly affects the delay before the agent’s response.
Location in Configuration: ttsConfig.responsiveness
Type: Number (0 to 1, optional)
Default: Platform-managed
How Responsiveness Works
Important: The responsiveness value controls the delay before the agent responds. A value of 1.0 provides the most responsive agent with minimal delay. Lower values add artificial delay before the agent begins speaking.
| Value | Behavior | Response Delay |
|---|
| 1.0 | Most responsive (recommended) | Minimal delay before response |
| 0.7 | Slightly delayed | Small delay added before response |
| 0.5 | Moderately delayed | Moderate delay added before response |
| 0.3 | Significantly delayed | Longer delay added before response |
| 0.0 | Maximum delay | Longest delay before response |
Recommended Configuration
For most use cases, 1.0 is the recommended value as it provides the most natural, responsive conversation experience with minimal delay.
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
responsiveness: 1.0 // Recommended: most responsive
}
When to Use Lower Values
Lower responsiveness values (adding delay) may be useful in specific scenarios:
- IVR-style systems where a slight pause feels more natural
- Accessibility requirements where users need extra processing time
- Compliance scenarios where pacing requirements exist
Technical Detail: Values below 1.0 add an artificial delay before the agent begins responding. This does not affect speech speed (controlled by the speed parameter) or voice quality—it only delays when the response starts.
Example Configurations
Optimal (1.0)
With Delay (0.7)
Recommended for most use cases:
- Customer support
- Sales calls
- Appointment scheduling
- General conversational scenarios
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
responsiveness: 1.0 // Most responsive, minimal delay
}
When a slight pause is desired:
- IVR-style interactions
- Specific accessibility needs
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
responsiveness: 0.7 // Adds slight delay before response
}
Best Practice: Start with 1.0 (the most responsive setting) and only reduce it if you have a specific requirement for adding delay before responses. Most conversational AI applications perform best with minimal response delay.
Speed Configuration
Speed Ranges by Provider
Different providers support different speed ranges:
| Provider | Min Speed | Max Speed | Default | Recommended Range |
|---|
| ElevenLabs | 0.70x | 1.20x | 1.0x | 0.9x - 1.2x |
| Cartesia | 0x (sent as 0.25x) | 2.0x | 1.0x | 0.8x - 1.3x |
| Dasha | 0.25x | 4.0x | 1.0x | 0.8x - 1.5x |
| Inworld | 0.80x | 1.50x | 1.0x | 0.9x - 1.1x |
| LMNT | Fixed 1.0x | Fixed 1.0x | 1.0x | 1.0x (only) |
Choosing the Right Speed
Slower Speeds (0.7x - 0.9x):
- Accessibility needs
- Complex information delivery
- Educational content
- Non-native speakers
- Legal/compliance disclosures
Normal Speed (1.0x):
- General conversation
- Customer support
- Sales calls
- Most use cases
Faster Speeds (1.1x - 1.5x):
- Time-sensitive scenarios
- Familiar/repetitive content
- High-volume information
- Experienced users
Speed Extremes: Avoid speeds below 0.7x (too slow, unnatural) or above 1.5x (too fast, hard to understand). Test thoroughly before production.
Dynamic Speed Adjustment
Dasha BlackBox supports dynamic speed adjustment during conversations, allowing the agent to adapt its speaking pace based on user requests (e.g., “Can you speak more slowly?”).
Location in Configuration: ttsConfig.speedAdjustment
SpeedAdjustmentSettings
| Field | Type | Default | Description |
|---|
version | string | ”v1” | Configuration version |
strategy | SpeedAdjustment | ”OnRequest” | Speed adjustment strategy |
SpeedAdjustment Strategies
| Strategy | Description |
|---|
"OnRequest" | (Default) Agent can adjust speed when user requests it |
"Disabled" | Speed remains fixed at the configured value |
Relationship to Base Speed
The speedAdjustment setting works in conjunction with the base speed parameter:
- Base
speed: Sets the initial/default speech rate (e.g., 1.0x)
speedAdjustment.strategy: Controls whether the agent can deviate from the base speed during conversation
Example Configuration:
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
speed: 1.0, // Base speed
speedAdjustment: {
version: "v1",
strategy: "OnRequest" // Allow user to request speed changes
}
}
Disabled Speed Adjustment:
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
speed: 1.1, // Fixed at 1.1x
speedAdjustment: {
version: "v1",
strategy: "Disabled" // Always use base speed
}
}
When to Disable: Consider disabling speed adjustment for compliance-focused applications where consistent delivery speed is required for legal or regulatory reasons.
Voice Cloning
Create custom voices that match your brand identity with ElevenLabs voice cloning.
Cloning Process
- Prepare Audio: Record 1-5 minutes of clean voice samples
- Upload: Use the voice cloning interface or API
- Configure: Set name, description, language
- Clone: ElevenLabs processes your samples
- Use: Select cloned voice in agent configuration
API-Based Voice Cloning
// Clone a new voice
const formData = new FormData();
formData.append('Name', 'My Brand Voice');
formData.append('Description', 'Custom voice for customer support');
formData.append('Language', 'en-US');
formData.append('Provider', 'ElevenLabs');
formData.append('audioFiles', audioFile); // File object
const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/clone', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
},
body: formData
});
const clonedVoice = await response.json();
console.log('Cloned voice ID:', clonedVoice.voiceId);
Managing Cloned Voices
// Update cloned voice description
const voiceId = 'your-cloned-voice-id';
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
method: 'PATCH',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
description: 'Updated voice description'
})
});
// Delete cloned voice
await fetch(`https://blackbox.dasha.ai/api/v1/voice/clone/${voiceId}`, {
method: 'DELETE',
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
}
});
For detailed voice cloning instructions, see Voice Cloning.
Pronunciation Dictionary
Pronunciation dictionaries allow you to customize how your agent pronounces specific words or phrases. This is useful for brand names, technical terms, acronyms, or any words the TTS provider might mispronounce.
Location in Configuration: ttsConfig.pronunciationDictionary
PronunciationDictionaryReference
The pronunciationDictionary field in TtsConfig references a pre-created pronunciation dictionary by its ID.
| Field | Type | Required | Description |
|---|
id | string | Yes | Unique identifier of the pronunciation dictionary |
hash | string | Yes | Content hash for version tracking |
Supported Rule Types
Pronunciation dictionaries support two types of rules:
| Rule Type | Description | Use Case |
|---|
| Alias | Replace one word/phrase with another | Acronyms, abbreviations, alternate spellings |
| Phoneme | Specify exact phonetic pronunciation | Precise control over difficult words |
Supported Providers
Pronunciation dictionaries are supported by:
- ElevenLabs: Full support for alias rules
- Cartesia: Full support for alias and phoneme rules
Other TTS providers (Dasha, Inworld, LMNT) do not currently support pronunciation dictionaries. The configuration will be ignored for these providers.
Example Configuration
Referencing a Pronunciation Dictionary:
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "your-voice-id",
model: "eleven_turbo_v2_5",
speed: 1.0,
pronunciationDictionary: {
id: "pd_abc123def456",
hash: "a1b2c3d4e5f6"
}
}
Creating Pronunciation Dictionaries
Pronunciation dictionaries are created and managed through the API:
// Create a pronunciation dictionary
const response = await fetch('https://blackbox.dasha.ai/api/v1/pronunciation-dictionaries', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Company Terms",
provider: "Cartesia",
rules: [
{
type: "alias",
stringToReplace: "API",
replacement: "A P I"
},
{
type: "alias",
stringToReplace: "SQL",
replacement: "sequel"
},
{
type: "phoneme",
stringToReplace: "Dasha",
phoneme: "ˈdɑːʃə",
alphabet: "ipa"
}
]
})
});
const dictionary = await response.json();
console.log('Dictionary ID:', dictionary.id);
Common Use Cases
Acronyms and Abbreviations:
- “API” → “A P I” (spell out) or “ay pee eye”
- “SQL” → “sequel” or “S Q L”
- “CEO” → “C E O”
Brand Names:
- “Nike” → phoneme for correct pronunciation
- “Dasha” → phoneme “ˈdɑːʃə”
Technical Terms:
- “kubectl” → “kube control” or “kube C T L”
- “nginx” → “engine X”
Best Practice: Create a single, comprehensive pronunciation dictionary for your organization and reference it across all agents. This ensures consistent pronunciation across your entire voice AI deployment.
API Cross-References
Speech Recognition (ASR)
How ASR Works in Dasha BlackBox
Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), converts user audio into text for your agent’s LLM to process.
Automatic Provider Selection:
Dasha BlackBox automatically selects the optimal ASR provider based on:
- Language Detection: Identifies user’s spoken language
- Accent Recognition: Adjusts for regional variations
- Network Quality: Adapts to connection conditions
- Provider Performance: Routes to best-performing provider
- Real-Time Optimization: Switches providers if quality degrades
Supported ASR Providers
Dasha BlackBox integrates with multiple ASR providers for redundancy and quality:
- Deepgram: High-accuracy, low-latency transcription
- Microsoft Speech Services: Enterprise-grade recognition
- Auto (Platform-Managed): Automatic provider selection (recommended)
No Configuration Required: You don’t need to select an ASR provider. The platform handles this automatically for optimal results.
ASR Quality Factors
Several factors affect transcription accuracy:
Audio Quality:
- Clear microphone input
- Minimal background noise
- Good network connection
- Proper audio levels
User Factors:
- Speech clarity and pace
- Accent and pronunciation
- Use of domain-specific terms
- Speaking patterns
System Optimization:
- Automatic noise cancellation
- Echo suppression
- Acoustic model adaptation
- Real-time quality monitoring
Improving ASR Accuracy
While ASR is automatic, you can help improve accuracy:
- Agent Prompting: Guide users to speak clearly in your agent’s greeting
- Confirmation: Have agent repeat understood information for verification
- Clarification: Prompt for clarification when confidence is low
- STT Keywords: Configure keywords to boost recognition of domain-specific terms (see below)
STT Keywords
STT Keywords allow you to improve speech recognition accuracy for domain-specific terms, product names, proper nouns, and industry jargon. By providing a list of keywords, you help the ASR system prioritize recognition of these specific terms.
Location in Configuration: sttConfig.keywords
Type: Array of SttKeyword objects (optional)
SttKeyword Structure
| Field | Type | Required | Description |
|---|
keyword | string | Yes | The word or phrase to boost recognition for |
weight | number | No | Recognition priority boost (higher = more likely to be recognized) |
How Weight Works
The weight parameter influences how the ASR system prioritizes recognizing certain words:
- No weight specified: Standard boost for the keyword
- Higher weight values: Increased priority for recognition (e.g., 0.7, 1.0)
- Lower weight values: Slight boost, less aggressive prioritization
Choosing Weights: Start without weights for most keywords. Add weights (0.7 - 1.0) only for critical terms that are frequently misrecognized.
Use Cases for STT Keywords
| Use Case | Example Keywords |
|---|
| Product Names | ”Dasha BlackBox”, “iPhone Pro Max”, “Model S Plaid” |
| Medical Terms | ”hypertension”, “metformin”, “echocardiogram” |
| Proper Nouns | ”Dasha AI”, “Anthropic”, “OpenAI” |
| Industry Jargon | ”SaaS”, “ARR”, “churn rate”, “LTV” |
| Company-Specific | Internal project names, employee names, location names |
| Technical Terms | ”API endpoint”, “webhook”, “OAuth” |
Example Configurations
Basic Keywords (No Weights):
sttConfig: {
version: "v1",
vendor: "Auto",
keywords: [
{ keyword: "Dasha BlackBox" },
{ keyword: "Dasha AI" },
{ keyword: "webhook" }
]
}
Keywords with Weights:
sttConfig: {
version: "v1",
vendor: "Auto",
keywords: [
{ keyword: "Dasha BlackBox", weight: 0.7 },
{ keyword: "Dasha AI", weight: 0.7 },
{ keyword: "metformin", weight: 1.0 },
{ keyword: "hypertension" },
{ keyword: "API endpoint" }
]
}
Medical/Healthcare Example:
sttConfig: {
version: "v1",
vendor: "Auto",
keywords: [
{ keyword: "lisinopril", weight: 1.0 },
{ keyword: "metoprolol", weight: 1.0 },
{ keyword: "echocardiogram", weight: 0.7 },
{ keyword: "systolic", weight: 0.7 },
{ keyword: "diastolic", weight: 0.7 },
{ keyword: "blood pressure" }
]
}
Financial Services Example:
sttConfig: {
version: "v1",
vendor: "Auto",
keywords: [
{ keyword: "401k", weight: 0.7 },
{ keyword: "Roth IRA", weight: 0.7 },
{ keyword: "rollover" },
{ keyword: "beneficiary" },
{ keyword: "contribution limit" }
]
}
Keyword Limits: While there’s no strict limit, using too many keywords (50+) may reduce their effectiveness. Focus on the most critical terms that are frequently misrecognized.
Configuration Examples
Basic Configuration (Recommended)
Simple, production-ready voice setup:
// Via Dashboard: Use defaults in Voice & Speech tab
// Via API:
const agent = await fetch('https://blackbox.dasha.ai/api/v1/agents', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
name: "Support Agent",
config: {
version: "v1",
primaryLanguage: "en-US",
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "zmcVlqmyk3Jpn5AVYcAL",
model: "eleven_flash_v2_5",
speed: 1.0
},
sttConfig: {
version: "v1",
vendor: "Auto" // Platform manages ASR automatically
},
llmConfig: {
version: "v1",
vendor: "openai",
model: "gpt-4.1-mini",
prompt: "You are a helpful assistant."
},
features: {
version: "v1",
languageSwitching: {
version: "v1",
isEnabled: false
},
rag: {
version: "v1",
isEnabled: false,
kbLinks: []
}
}
}
})
});
ElevenLabs with Customization
High-quality voice with fine-tuned parameters:
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
model: "eleven_turbo_v2_5",
speed: 1.1,
vendorSpecificOptions: {
similarity_boost: 0.85,
stability: 0.7,
style: 0.4,
use_speaker_boost: true,
optimize_streaming_latency: 3
}
}
Cartesia with Emotions
Low-latency with emotional expressiveness:
ttsConfig: {
version: "v1",
vendor: "Cartesia",
voiceId: "cartesia-friendly-voice",
model: "sonic",
speed: 1.2,
vendorSpecificOptions: {
emotions: [
"positivity:high",
"curiosity:low"
]
}
}
Multilingual Configuration
Agent supporting multiple languages:
config: {
version: "v1",
primaryLanguage: "en-US", // Default language
ttsConfig: {
version: "v1",
vendor: "ElevenLabs",
voiceId: "multilingual-voice-id",
model: "eleven_multilingual_v2", // Supports 29+ languages
speed: 1.0
},
features: {
version: "v1",
languageSwitching: {
version: "v1",
isEnabled: true
}
}
}
Testing Voice Configuration
Dashboard Testing
Use the built-in test widget to verify voice quality:
- Save Agent: Save your voice configuration
- Open Test Widget: Click “Test Agent” in dashboard
- Start Conversation: Begin voice interaction
- Listen Carefully: Evaluate voice quality, speed, clarity
- Iterate: Adjust settings and re-test as needed
Test your agent’s voice directly from the dashboard
Voice Synthesis API Testing
Test TTS synthesis without creating an agent:
// API endpoint: http://localhost:8080 (development) or your production URL
// Synthesize speech using a public ElevenLabs voice
const response = await fetch('https://blackbox.dasha.ai/api/v1/voice/synthesize', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: "Hello world, this is a test of the voice synthesis system.",
provider: "ElevenLabs",
voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel (public voice)
model: "eleven_turbo_v2_5",
language: "en-US",
speed: 1.0
})
});
if (response.ok) {
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
}
What to Test
Quality Checklist:
Best Practices
Voice Selection
- Match Your Brand: Choose voices that align with your brand identity
- Consider Audience: Select demographics-appropriate voices
- Test Multiple Options: Preview 3-5 voices before deciding
- Get Feedback: Test with representative users
- Document Choice: Note why you selected specific voices
Speed Settings
- Start at 1.0x: Use default speed as baseline
- Test Incrementally: Adjust in 0.05x increments
- Context Matters: Different content may need different speeds
- A/B Test: Compare speeds with real users
- Monitor Feedback: Track user satisfaction metrics
Provider Selection
Choose ElevenLabs if:
- You need highest voice quality
- Brand-specific voice cloning is important
- Advanced customization is required
- You can tolerate slightly higher latency
Choose Cartesia if:
- Real-time responsiveness is critical
- Emotional expression is important
- You need ultra-low latency (< 250ms)
- Conversation feels more important than absolute quality
Choose Dasha if:
- You want simple, reliable configuration
- Platform integration is a priority
- You need widest speed adjustment range
- Quick setup is important
Choose Inworld if:
- You’re building character-driven experiences
- Gaming or interactive media is your use case
- Voice expressiveness is critical
Choose LMNT if:
- You need consistent, predictable output
- Minimal configuration is desired
- Fixed speed (1.0x) works for your needs
Common Mistakes to Avoid
Avoid These Mistakes:
- Not Previewing: Always preview before deploying
- Extreme Speeds: Don’t use < 0.7x or > 1.5x without extensive testing
- Mismatched Languages: Ensure voice language matches agent language
- Over-Optimization: Don’t sacrifice quality for marginal latency gains
- Ignoring Feedback: Listen to user complaints about voice quality
- Single Voice Testing: Test multiple voices before committing
Troubleshooting
Voice Issues
Problem: Voice sounds robotic or unnatural
- Solution: Try a different voice from the same provider
- Solution: Adjust stability/temperature parameters (ElevenLabs/Inworld)
- Solution: Switch to ElevenLabs for highest quality
Problem: Speech is too fast/slow
- Solution: Adjust speed setting incrementally
- Solution: Test with representative users
- Solution: Consider accessibility needs (slower may be better)
Problem: Pronunciation errors
- Solution: Use phonetic spelling in system prompt
- Solution: Try different voice from same provider
- Solution: Consider voice cloning with correct pronunciation
Problem: High latency before speech starts
- Solution: Switch to Cartesia for lowest latency
- Solution: Use ElevenLabs with
optimize_streaming_latency: 4
- Solution: Check network conditions
ASR Issues
Problem: Poor transcription accuracy
- Solution: No action needed - ASR auto-optimizes
- Solution: Prompt users to speak clearly in agent greeting
- Solution: Add confirmation/verification in conversation flow
Problem: Wrong language detected
- Solution: Ensure agent
primaryLanguage matches user language
- Solution: Enable language switching if supporting multiple languages
Next Steps
API Cross-References