Generative Audio API: TTS, Voice Cloning & Speech Recognition
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
Generative Audio API Overview
Hypereal provides a comprehensive suite of audio generation APIs for text-to-speech, voice cloning, and speech recognition. All audio models are accessible through a unified API with competitive pricing.
Available Audio Models
| Model | Slug | Description | Pricing |
|---|---|---|---|
| Text to Speech | audio-tts |
High-quality TTS with 64+ emotions | $0.015/1000 chars |
| Voice Clone | audio-clone |
Zero-shot voice cloning | $0.015/1000 chars |
| Speech Recognition | audio-asr |
Transcribe audio to text | $0.006/minute |
| Minimax Voice Clone | minimax-voice-clone |
Premium voice cloning | $0.50/generation |
| Speech Turbo | minimax-speech-02 |
Fast TTS with emotion control | $0.003/generation |
| Music Generation | minimax-music-02 |
AI music with vocals | $0.045/song |
Emotional Text-to-Speech
One of the most powerful features is emotion control with 64+ emotional expressions organized into four categories.
Basic Emotions (24)
Core emotional states for natural speech:
happy, sad, angry, excited, calm, nervous, confident, surprised, satisfied, delighted, scared, worried, upset, frustrated, depressed, empathetic, embarrassed, disgusted, moved, proud, relaxed, grateful, curious, sarcastic
Advanced Emotions (25)
More nuanced expressions:
disdainful, unhappy, anxious, hysterical, indifferent, uncertain, doubtful, confused, disappointed, regretful, guilty, ashamed, jealous, envious, hopeful, optimistic, pessimistic, nostalgic, lonely, bored, contemptuous, sympathetic, compassionate, determined, resigned
Tone Markers (5)
Delivery style modifiers:
in a hurry tone- Urgent deliveryshouting- Loud, emphaticscreaming- Intense, high volumewhispering- Soft, intimatesoft tone- Gentle delivery
Audio Effects (10)
Sound effects and vocalizations:
laughing, chuckling, sobbing, crying loudly, sighing, groaning, panting, gasping, yawning, snoring
Plus special effects: audience laughter, crowd laughter, pause breaks.
Emotion Syntax
Wrap emotions in parentheses at the start of your text:
(happy) What a beautiful day!
(sad) I'm sorry for your loss.
(excited) I can't believe we won!
Combining Emotions
Stack multiple tags for complex expressions:
(sad)(whispering) I'll miss you.
(excited)(laughing) This is amazing!
(nervous)(in a hurry tone) We need to go now!
API Examples
Text-to-Speech with Emotion
const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'audio-tts',
text: '(excited) Welcome to our platform! We are so happy to have you here.',
format: 'mp3',
temperature: 0.7
})
});
Voice Cloning
const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'audio-clone',
text: '(confident) This is my cloned voice speaking.',
audio: 'https://example.com/my-voice-sample.mp3',
format: 'mp3'
})
});
Speech Recognition
const response = await fetch('https://api.hypereal.com/v1/audio/generate', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'audio-asr',
audio: 'https://example.com/speech-recording.mp3',
language: 'en',
ignore_timestamps: false
})
});
// Response includes text and timestamps
// { text: "...", duration: 5.2, segments: [...] }
Best Practices
Emotion Usage
- Start simple - Test with basic emotions before combining
- Match content - Align emotions with your text meaning
- Don't overuse - Avoid too many emotion tags in short text
- Test variations - Different voices may express emotions differently
Voice Cloning
- Quality reference - Use clean, noise-free audio (10-30 seconds)
- Clear speech - Reference should have clear pronunciation
- Enable enhancement - Use
enhance_audio_quality: truefor noisy samples
Speech Recognition
- Specify language - Improves accuracy significantly
- Quality audio - Clear recordings produce better results
- Use timestamps - Enable for subtitle/caption generation
Output Formats
All TTS and voice cloning models support:
- MP3 - Best for general use (default)
- WAV - Uncompressed, best for editing
- PCM - Raw audio data
- Opus - Efficient for streaming
Supported Languages
Multi-language support including:
- English (en)
- Chinese (zh)
- Japanese (ja)
- Spanish (es)
- French (fr)
- German (de)
- And many more
Why Choose Hypereal for Audio?
- Unified API - One endpoint for TTS, cloning, and ASR
- Competitive pricing - Up to 80% cheaper than alternatives
- 64+ emotions - Industry-leading expression control
- Zero-shot cloning - Clone any voice from a short sample
- Fast processing - Optimized for low latency
Get Started Free - No credit card required.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
