Skip to main content
POST
/
api
/
v1
/
voice
/
clone
Clone voice from audio
const form = new FormData();
form.append('Name', '<string>');
form.append('Description', '<string>');
form.append('Language', '<string>');
form.append('Provider', '<string>');
form.append('ProviderSpecific.ElevenLabs.RemoveBackgroundNoise', 'true');
form.append('ProviderSpecific.Cartesia.Mode', '<string>');
form.append('ProviderSpecific.Cartesia.Enhance', 'true');
form.append('ProviderSpecific.Cartesia.Transcript', '<string>');
form.append('Labels', '{}');
form.append('audioFiles', '<string>');
form.append('audioFiles.items', '{
  "fileName": "example-file"
}');

const options = {method: 'POST', headers: {Authorization: 'Bearer <token>'}};

options.body = form;

fetch('https://blackbox.dasha.ai/api/v1/voice/clone', options)
  .then(res => res.json())
  .then(res => console.log(res))
  .catch(err => console.error(err));
{
  "id": "<string>",
  "provider": "<string>",
  "category": "Public",
  "name": "<string>",
  "voiceId": "<string>",
  "description": "<string>",
  "language": "<string>",
  "labels": {},
  "previewUrl": "<string>",
  "createdTime": "2023-11-07T05:31:56Z",
  "lastUpdatedTime": "2023-11-07T05:31:56Z"
}

Audio Requirements

  • Format: WAV, MP3, or FLAC
  • Quality: Minimum 16kHz sample rate (44.1kHz recommended)
  • Duration: 30 seconds to 10 minutes of clear speech
  • Content: Clean speech without background noise or music
  • Speaker: Single speaker with consistent tone
  • Total size: Maximum 15MB combined

Voice Cloning Process

  1. Audio analysis: Extracts vocal characteristics and speech patterns
  2. Model training: Creates custom voice model
  3. Quality validation: Ensures voice meets quality standards
  4. Library storage: Voice added to organization’s library

Use Cases

  • Branded voices for customer service
  • Personalized voice assistants
  • Multilingual deployment with consistent identity
  • Executive voice replication for announcements

Body

multipart/form-data
Name
string
required

Display name for the cloned voice

Required string length: 1 - 100
Description
string
required

Description of voice characteristics and intended use

Required string length: 1 - 1000
Language
string
required

Primary language for the voice model. The cloned voice will be optimized for this language.

Provider
string
required

Voice cloning provider to use for creating the voice model. Supported providers: ElevenLabs, Cartesia, Lmnt.

ProviderSpecific.ElevenLabs.RemoveBackgroundNoise
boolean

Whether to remove background noise from the audio sample during cloning for cleaner voice reproduction

ProviderSpecific.Cartesia.Mode
string

Cloning mode controlling the balance between voice stability and similarity. Supported values: "Stability", "Similarity".

ProviderSpecific.Cartesia.Enhance
boolean

Whether to apply audio enhancement during cloning for improved quality

ProviderSpecific.Cartesia.Transcript
string

Text transcript of the audio sample being cloned. Helps improve cloning accuracy.

Labels
object

Custom metadata labels for categorizing and organizing voices. Useful for filtering and searching cloned voices.

audioFiles
file[]

Audio files for cloning

Response

Cloned voice created successfully

Response DTO for TTS voice cloning operations

id
string
required

Unique identifier for the voice

Minimum string length: 1
provider
string
required

TTS provider used for this voice. Supported providers: ElevenLabs, Cartesia, Lmnt.

Minimum string length: 1
category
enum<string>
required

Voice category. Public voices are provided by the TTS provider, Cloned voices are custom voices created through voice cloning.

Available options:
Public,
Cloned
name
string | null

Display name of the voice

voiceId
string | null

Voice ID used for synthesis

description
string | null

Description of voice characteristics

language
string | null

Primary language for the voice

labels
object

Custom metadata labels

previewUrl
string | null

URL for voice preview audio

createdTime
string<date-time> | null

Timestamp when voice was created

lastUpdatedTime
string<date-time> | null

Timestamp when voice was last updated