Generative Audio API: TTS, Voice Cloning & Speech Recognition

Generative Audio API Overview

Hypereal provides a comprehensive suite of audio generation APIs for text-to-speech, voice cloning, and speech recognition. All audio models are accessible through a unified API with competitive pricing.

Available Audio Models

Model	Slug	Description	Pricing
Text to Speech	`audio-tts`	High-quality TTS with 64+ emotions	$0.015/1000 chars
Voice Clone	`audio-clone`	Zero-shot voice cloning	$0.015/1000 chars
Speech Recognition	`audio-asr`	Transcribe audio to text	$0.006/minute
Minimax Voice Clone	`minimax-voice-clone`	Premium voice cloning	$0.50/generation
Speech Turbo	`minimax-speech-02`	Fast TTS with emotion control	$0.003/generation
Music Generation	`minimax-music-02`	AI music with vocals	$0.045/song

Emotional Text-to-Speech

One of the most powerful features is emotion control with 64+ emotional expressions organized into four categories.

Basic Emotions (24)

Core emotional states for natural speech:

happy, sad, angry, excited, calm, nervous, confident, surprised, satisfied, delighted, scared, worried, upset, frustrated, depressed, empathetic, embarrassed, disgusted, moved, proud, relaxed, grateful, curious, sarcastic

Advanced Emotions (25)

More nuanced expressions:

disdainful, unhappy, anxious, hysterical, indifferent, uncertain, doubtful, confused, disappointed, regretful, guilty, ashamed, jealous, envious, hopeful, optimistic, pessimistic, nostalgic, lonely, bored, contemptuous, sympathetic, compassionate, determined, resigned

Tone Markers (5)

Delivery style modifiers:

in a hurry tone - Urgent delivery
shouting - Loud, emphatic
screaming - Intense, high volume
whispering - Soft, intimate
soft tone - Gentle delivery

Audio Effects (10)

Sound effects and vocalizations:

laughing, chuckling, sobbing, crying loudly, sighing, groaning, panting, gasping, yawning, snoring

Plus special effects: audience laughter, crowd laughter, pause breaks.

Emotion Syntax

Wrap emotions in parentheses at the start of your text:

(happy) What a beautiful day!
(sad) I'm sorry for your loss.
(excited) I can't believe we won!

Combining Emotions

Stack multiple tags for complex expressions:

(sad)(whispering) I'll miss you.
(excited)(laughing) This is amazing!
(nervous)(in a hurry tone) We need to go now!

API Examples

Text-to-Speech with Emotion

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-tts',
    text: '(excited) Welcome to our platform! We are so happy to have you here.',
    format: 'mp3',
    temperature: 0.7
  })
});

Voice Cloning

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-clone',
    text: '(confident) This is my cloned voice speaking.',
    audio: 'https://example.com/my-voice-sample.mp3',
    format: 'mp3'
  })
});

Speech Recognition

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-asr',
    audio: 'https://example.com/speech-recording.mp3',
    language: 'en',
    ignore_timestamps: false
  })
});

// Response includes text and timestamps
// { text: "...", duration: 5.2, segments: [...] }

Best Practices

Emotion Usage

Start simple - Test with basic emotions before combining
Match content - Align emotions with your text meaning
Don't overuse - Avoid too many emotion tags in short text
Test variations - Different voices may express emotions differently

Voice Cloning

Quality reference - Use clean, noise-free audio (10-30 seconds)
Clear speech - Reference should have clear pronunciation
Enable enhancement - Use enhance_audio_quality: true for noisy samples

Speech Recognition

Specify language - Improves accuracy significantly
Quality audio - Clear recordings produce better results
Use timestamps - Enable for subtitle/caption generation

Output Formats

All TTS and voice cloning models support:

MP3 - Best for general use (default)
WAV - Uncompressed, best for editing
PCM - Raw audio data
Opus - Efficient for streaming

Supported Languages

Multi-language support including:

English (en)
Chinese (zh)
Japanese (ja)
Spanish (es)
French (fr)
German (de)
And many more

Why Choose Hypereal for Audio?

Unified API - One endpoint for TTS, cloning, and ASR
Competitive pricing - Up to 80% cheaper than alternatives
64+ emotions - Industry-leading expression control
Zero-shot cloning - Clone any voice from a short sample
Fast processing - Optimized for low latency

Get Started Free - No credit card required.

Generative Audio API Overview

Available Audio Models

Model	Slug	Description	Pricing
Text to Speech	`audio-tts`	High-quality TTS with 64+ emotions	$0.015/1000 chars
Voice Clone	`audio-clone`	Zero-shot voice cloning	$0.015/1000 chars
Speech Recognition	`audio-asr`	Transcribe audio to text	$0.006/minute
Minimax Voice Clone	`minimax-voice-clone`	Premium voice cloning	$0.50/generation
Speech Turbo	`minimax-speech-02`	Fast TTS with emotion control	$0.003/generation
Music Generation	`minimax-music-02`	AI music with vocals	$0.045/song

Emotional Text-to-Speech

One of the most powerful features is emotion control with 64+ emotional expressions organized into four categories.

Basic Emotions (24)

Core emotional states for natural speech:

Advanced Emotions (25)

More nuanced expressions:

Tone Markers (5)

Delivery style modifiers:

in a hurry tone - Urgent delivery
shouting - Loud, emphatic
screaming - Intense, high volume
whispering - Soft, intimate
soft tone - Gentle delivery

Audio Effects (10)

Sound effects and vocalizations:

laughing, chuckling, sobbing, crying loudly, sighing, groaning, panting, gasping, yawning, snoring

Plus special effects: audience laughter, crowd laughter, pause breaks.

Emotion Syntax

Wrap emotions in parentheses at the start of your text:

(happy) What a beautiful day!
(sad) I'm sorry for your loss.
(excited) I can't believe we won!

Combining Emotions

Stack multiple tags for complex expressions:

(sad)(whispering) I'll miss you.
(excited)(laughing) This is amazing!
(nervous)(in a hurry tone) We need to go now!

API Examples

Text-to-Speech with Emotion

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-tts',
    text: '(excited) Welcome to our platform! We are so happy to have you here.',
    format: 'mp3',
    temperature: 0.7
  })
});

Voice Cloning

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-clone',
    text: '(confident) This is my cloned voice speaking.',
    audio: 'https://example.com/my-voice-sample.mp3',
    format: 'mp3'
  })
});

Speech Recognition

const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'audio-asr',
    audio: 'https://example.com/speech-recording.mp3',
    language: 'en',
    ignore_timestamps: false
  })
});

// Response includes text and timestamps
// { text: "...", duration: 5.2, segments: [...] }

Best Practices

Emotion Usage

Start simple - Test with basic emotions before combining
Match content - Align emotions with your text meaning
Don't overuse - Avoid too many emotion tags in short text
Test variations - Different voices may express emotions differently

Voice Cloning

Quality reference - Use clean, noise-free audio (10-30 seconds)
Clear speech - Reference should have clear pronunciation
Enable enhancement - Use enhance_audio_quality: true for noisy samples

Speech Recognition

Specify language - Improves accuracy significantly
Quality audio - Clear recordings produce better results
Use timestamps - Enable for subtitle/caption generation

Output Formats

All TTS and voice cloning models support:

MP3 - Best for general use (default)
WAV - Uncompressed, best for editing
PCM - Raw audio data
Opus - Efficient for streaming

Supported Languages

Multi-language support including:

English (en)
Chinese (zh)
Japanese (ja)
Spanish (es)
French (fr)
German (de)
And many more

Why Choose Hypereal for Audio?

Unified API - One endpoint for TTS, cloning, and ASR
Competitive pricing - Up to 80% cheaper than alternatives
64+ emotions - Industry-leading expression control
Zero-shot cloning - Clone any voice from a short sample
Fast processing - Optimized for low latency

Get Started Free - No credit card required.

Start Building with Hypereal

Generative Audio API Overview

Available Audio Models

Emotional Text-to-Speech

Basic Emotions (24)

Advanced Emotions (25)

Tone Markers (5)

Audio Effects (10)

Emotion Syntax

Combining Emotions

API Examples

Text-to-Speech with Emotion

Voice Cloning

Speech Recognition

Best Practices

Emotion Usage

Voice Cloning

Speech Recognition

Output Formats

Supported Languages

Why Choose Hypereal for Audio?

Related Articles

Where to Use Minimax Music 2.0 API: Top Use Cases for AI Music Generation

How to Use AI Music Generation API: Create Tracks via REST API (2026)

How to Use AI Voice Cloning API: Clone Any Voice in Seconds (2026)

Start Building Today

Start Building with Hypereal

Generative Audio API Overview

Available Audio Models

Emotional Text-to-Speech

Basic Emotions (24)

Advanced Emotions (25)

Tone Markers (5)

Audio Effects (10)

Emotion Syntax

Combining Emotions

API Examples

Text-to-Speech with Emotion

Voice Cloning

Speech Recognition

Best Practices

Emotion Usage

Voice Cloning

Speech Recognition

Output Formats

Supported Languages

Why Choose Hypereal for Audio?

Related Articles

Where to Use Minimax Music 2.0 API: Top Use Cases for AI Music Generation

How to Use AI Music Generation API: Create Tracks via REST API (2026)

How to Use AI Voice Cloning API: Clone Any Voice in Seconds (2026)

Start Building Today