MiniMax Voice Maker
name: mm-voice-maker
by blue-coconut · published 2026-03-22
$ claw add gh:blue-coconut/blue-coconut-mm-voice-maker---
name: mm-voice-maker
description: Enables voice synthesis, voice cloning, voice design, and audio post-processing using MiniMax Voice API and FFmpeg. Use when converting text to speech, creating custom voices, or processing/merging audio.
---
# MiniMax Voice Maker
Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.
Capabilities
| Area | Features |
|------|----------|
| **TTS** | Sync (HTTP/WebSocket), async (long text), streaming |
| **Segment-based** | Multi-voice, multi-emotion synthesis from segments.json, auto merge |
| **Voice** | Cloning (10s–5min), design (text prompt), management |
| **Audio** | Format conversion, merge, normalize, trim, remove silence (FFmpeg) |
File structure:
mmVoice_Maker/
├── SKILL.md # This overview
├── mmvoice.py # CLI tool (recommended for Agents)
├── check_environment.py # Environment verification
├── requirements.txt
├── scripts/ # Entry: scripts/__init__.py
│ ├── utils.py # Config, data classes
│ ├── sync_tts.py # HTTP/WebSocket TTS
│ ├── async_tts.py # Long text TTS
│ ├── segment_tts.py # Segment-based TTS (multi-voice, multi-emotion)
│ ├── voice_clone.py # Voice cloning
│ ├── voice_design.py # Voice design
│ ├── voice_management.py # List/delete voices
│ └── audio_processing.py # FFmpeg audio tools
└── reference/ # Load as needed
├── cli-guide.md # CLI usage guide
├── getting-started.md # Setup and quick test
├── tts-guide.md # Sync/async TTS workflows
├── voice-guide.md # Clone/design/manage
├── audio-guide.md # Audio processing
├── script-examples.md # Runnable code snippets
├── troubleshooting.md # Common issues
├── api_documentation.md # Complete API reference
└── voice_catalog.md # Voice selection guideMain Workflow Guideline (Text to Speech)
**6-step workflow:**
[step1]. Verify environment
[step2-preparation]⚠️NOTE: Before processing the text, you must read [voice-catalog.md](reference/voice-catalog.md) for voice selection.
[step2]. Process text into script → `<cwd>/audio/segments.json`. Note: [Step2.4] is really important, you must check it twice before sending the script to the user.
[step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content)
[step3]. Present plan to user for confirmation
[step4]. Validate segments.json
[step5]. Generate and merge audio → intermediate files in `<cwd>/audio/tmp/`, final output in `<cwd>/audio/output.mp3`
[step6]. ⚠️ **CRITICAL**: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied)
> `<cwd>` is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands.
Step 1: Verify environment
python check_environment.pyChecks:
If API key is not set, ask user for keys and set it:
export MINIMAX_VOICE_API_KEY="your-api-key-here"Step 2: Decision and Pre-processing
**⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First**
Before selecting voices, you MUST always match gender first. This is non-negotiable.
**Golden Rule:**
> **If a character is male → use male voice**
> **If a character is female → use female voice**
> **If a character is neutral/other → choose appropriate neutral voice**
**Why this matters:**
**Examples:**
| Character | Wrong Voice | Correct Voice |
|-----------|-------------|---------------|
| 唐三藏 (male monk) | `female-yujie` ❌ | `Chinese (Mandarin)_Gentleman` ✅ |
| 林黛玉 (female) | `male-qn-badao` ❌ | `female-shaonv` ✅ |
| 曹操 (male warlord) | `female-chengshu` ❌ | `Chinese (Mandarin)_Unrestrained_Young_Man` ✅ |
**Decision guide:**
Evaluate based on:
**Use case scenarios:**
| Scenario | Description | Segments | Voice Selection |
|----------|-------------|----------|-----------------|
| **Single Voice** | User needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment). | Split by length only | One voice_id for all segments |
| **Multi-Voice** | Multiple characters/speakers, each with different voice. Segment by speaker/role changes. | Split by logical unit (speaker, dialogue, etc.) | Different voice_id per role |
| **Podcast/Interview** | Host and guest speakers with distinct voices. | Split by speaker | Voice per host/guest |
| **Audiobook/Fiction** | Narrator and character voices. | Split by narration vs. dialogue | Voice per narrator/character |
| **Documentary** | Mostly narration with occasional quotes. | Keep as one segment | Single narrator voice |
| **Report/Announcement** | Formal content with consistent tone. | Keep as one segment | Professional voice |
**Processing Workflow (4 sub-steps):**
**Step 2.1: Text Segmentation and Role Analysis**
First, segment your text into logical units and identify the role/character for each segment.
**Key principle (Important!): Split by logical unit, NOT simply by sentence**
**When to split (Important!):**
**When NOT to split (Important!):**
**Decision depends on use case:**
| Use case | Example | Split strategy |
|----------|---------|----------------|
| **Single Voice** | Long article, news piece, announcement | Split by length (≤1,000,000 chars), same voice for all |
| **Podcast/Interview** | "Host: Welcome to the show. Guest: Thank you for having me." | Split by speaker |
| **Documentary narration** | "The scientist explained, 'The results are promising.'" | Keep as one segment (narrator voice) |
| **Audiobook/Fiction** | "'Who's there?' she whispered." | Split: "'Who's there?'" should be in character voice, while "she whispered." should be in narrator's voice |
| **Report** | "According to the report, the economy is growing." | Keep as one segment |
**Example1: Single Voice (speech-2.8)**
For single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:
[
{"text": "First part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
{"text": "Second part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
{"text": "Third part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""}
]**Example2: Audiobook with characters (speech-2.8)**
In audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:
[
{"text": "The detective entered the room.", "role": "narrator", "voice_id": "", "emotion": ""},
{"text": "\"Who's there?\"", "role": "female_character", "voice_id": "", "emotion": ""},
{"text": "she whispered.", "role": "narrator", "voice_id": "", "emotion": ""},
{"text": "\"It's me,\"", "role": "male_character", "voice_id": "", "emotion": ""},
{"text": "he replied calmly.", "role": "narrator", "voice_id": "", "emotion": ""}
]**Example3: Documentary/podcast narration (speech-2.8)**
Quoted speech in narration stays in narrator's voice (no need to split):
[
{
"text": "The scientist explained, \"The results show significant improvement in all test groups.\"",
"role": "narrator",
"voice_id": "",
"emotion": ""
},
{
"text": "According to the latest report, the economy has grown by 3% this quarter.",
"role": "narrator",
"voice_id": "",
"emotion": ""
}
]
**Note:** In the preliminary `segments.json`:
- Fill in the `text` field with segment content
- Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.)
- Leave `voice_id` empty (to be filled in Step 2.2)
- Leave `emotion` empty for speech-2.8 models
**Step 2.2: Voice Selection**
After segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 "How to Choose a Voice"** to match voices to characters.
**⚠️ CRITICAL: Follow the two-step selection process below**
**Path A — Professional domains (Story/Narration, News/Announcements, Documentary):**
If the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases.
**Path B — All other scenarios:**
Select from **voice_catalog.md Section 2.2**, following this strict priority hierarchy:
1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices
2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language.
3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2
4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role
**Voice Selection Decision Tree:**Is this a professional domain (Story/News/Documentary)?
├── YES → Select from voice_catalog Section 2.1 (filter by scenario + gender)
└── NO → Select from voice_catalog Section 2.2:
Step 1: Match Gender
├── Male character → Male voices only
└── Female character → Female voices only
Step 2: Match Age Group
└── Children / Youth / Adult / Elderly / Professional
Step 3: Match Language
└── Filter to voices matching the content language
Step 4: Match Personality & Role
└── Choose best fit by tone, personality, character role
**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)*
For models other than speech-2.8 series, analyze emotions in your segments:
- For **long segments**, split further based on **emotional transitions**
- Add appropriate **emotion tags** to each segment
- Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples
- Skip this step for speech-2.8 models (emotion is auto-matched)
**Emotion Tags:**
- For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
- For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)
**Step 2.4: Check and Post-processing**
Finally, review and optimize your script:
- Verify segment length limits (async TTS ≤1,000,000 characters)
- Clean up conversational text (remove speaker names if needed)
- Ensure consistency in voice and emotion tags
- **Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split.
**When splitting IS needed (first-person dialogue in fiction/audiobooks):**
Example: `"John asked, 'Where are you going?'"` should be split into:
- Segment 1: `"John asked, "` - uses narrator voice (describes who is speaking)
- Segment 2: `"Where are you going?"` - uses the character's voice (actual dialogue in first-person)
This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice.
**When splitting is NOT needed (third-person quotes in podcast/documentary/news):**
In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning:
- `"Welcome to our show." → narrator voice, remove the speaker's name (like "The host said:") at the beginning
- `"According to experts, 'This technology represents a significant breakthrough.'" → keep as one segment (narrator voice)
- `"Scientists noted, 'The experimental results exceeded our expectations.'" → keep as one segment (narrator voice)
- **If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels.
**Create segments.json:**
After completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`.
### Step 2.5: Generate Preview for User Confirmation (Highly Recommended)
**For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.**
This saves time and prevents waste when voice selections need adjustment.
**How to generate a preview:**
1. Create a smaller segments file with 10-20 representative segments (include all characters)
2. Generate the preview audio
3. Ask user to listen and confirm voice choices
**Preview segments.json example:**[
{"text": "Narration opening...", "role": "narrator", "voice_id": "...", "emotion": ""},
{"text": "Male character speaks...", "role": "male_character", "voice_id": "...", "emotion": ""},
{"text": "Female character speaks...", "role": "female_character", "voice_id": "...", "emotion": ""},
{"text": "More dialogue...", "role": "...", "voice_id": "...", "emotion": ""}
]
**Preview command:**python mmvoice.py generate segments_preview.json -o preview.mp3
**When user confirms preview:**
- Use the same voice selections for the full segments.json
- No need to re-select voices
---
### Step 3: Present plan to user for confirmation
Before proceeding to validation and generation, present the segmentation plan to the user and wait for confirmation:
**Present to the user:**
- **Roles identified**: List all characters/speakers in the text
- **Voice assignments**: Show which voice_id is assigned to each role (include voice characteristics from voice_catalog.md)
- **Model being used**: Explain why this model was selected
- **Language**: Confirm the primary language of the content
- **Emotion approach**: Auto-matched (speech-2.8) or manual tags (older models)
**Example confirmation message:**I've analyzed the text and created a segmentation plan:
**Roles and Voices:**
**Model:** speech-2.8-hd (recommended - automatic emotion matching)
**Language:** Chinese
**Segments:** 8 segments total
Please review and confirm:
1. ⚠️ **Gender Verification**: Do the voice genders match the character genders?
- [Narrator: Male ✓] [Protagonist: Female ✓] [Antagonist: Male ✓]
2. ⚠️ **Language Verification**: Do the voice languages match the content language?
- [All voices: Chinese ✓]
3. Are the voice assignments appropriate for each character (age, personality)?
4. Should any segments be combined or split differently?
5. Any other changes you'd like to make?
**After generation:**
Reply "confirm" to proceed, or let me know what to adjust.
**Wait for user response:**
- If user confirms → Proceed to Step 4 (validate)
- If user suggests changes → Update `segments.json` and present the plan again for confirmation
### Step 4: Validate segments.json (model, emotion, voice_id validation)
Before generating audio, validate the segments file:
# Default: speech-2.8-hd (auto emotion matching)
python mmvoice.py validate <cwd>/audio/segments.json
# Specify model for context-specific validation
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd
# Validate voice_ids against available voices (slower, requires API call)
python mmvoice.py validate <cwd>/audio/segments.json --validate-voices
# Combined options (recommended)
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices
# Use `--verbose` to see segment details
python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices --verbose
**Emotion Validation checks:**
| Model | Emotion Validation |
|-------|-------------------|
| **speech-2.8-hd/turbo** | Emotion can be empty (auto emotion matching) |
| **speech-2.6-hd/turbo** | All 9 emotions supported |
| **Older models** | happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions) |
**Voice ID validation:**
**With `--validate-voices`:**
- Calls API once to get all available voices
- Validates each voice_id against the list
- Shows errors for invalid voice_ids (blocks validation)
### Step 5: Generate and merge audio
Generate audio for all segments and merge into final output.
**File placement (default behavior if user doesn't specify):**
<cwd>/ # Claude's current working directory
└── audio/ # Created automatically
├── tmp/ # Intermediate segment files
│ ├── segment_0000.mp3
│ ├── segment_0001.mp3
│ └── ...
└── <custom_audio_name>.mp3 # Final merged audio, name can be customized
Where `<cwd>` is Claude's current working directory (where commands are executed).
- If `-o` is not specified, output goes to `<cwd>/audio/output.mp3`
- Intermediate files go to `<cwd>/audio/tmp/`
- After user confirms the final audio, ask whether to delete `<cwd>/audio/tmp/`
**Basic usage:**# Default: speech-2.8-hd, output to <cwd>/audio/output.mp3
python mmvoice.py generate <cwd>/audio/segments.json
# Specify output path
python mmvoice.py generate <cwd>/audio/segments.json -o <cwd>/audio/<custom_audio_name>.mp3
# Specify model if needed
python mmvoice.py generate <cwd>/audio/segments.json --model speech-2.6-hd
**Skip existing segments (for rate limit retries):**# Only generate segments that don't exist yet - skips already-generated files
python mmvoice.py generate <cwd>/audio/segments.json --skip-existing
**Error handling:**
- If a segment fails, the script reports which segment and why
- Use `--continue-on-error` to generate remaining segments despite failures
- Use `--skip-existing` to skip already successfully generated segments (recommended for retries after rate limit)
- The script automatically uses fallback merging if FFmpeg filter_complex fails
### Step 6: Confirm and cleanup
**⚠️ CRITICAL: Never delete temp files until user confirms!**
After generation completes, you MUST follow this exact sequence:
**Step 6.1: Report generation result to user**✓ Audio saved to: <output_path>
Generated: X/Y segments
Intermediate files in: <cwd>/audio/tmp/
**Step 6.2: Ask user to confirm audio quality**
Ask the user to listen to the audio and confirm:
1. Is the audio quality satisfactory?
2. Are all voices appropriate?
3. Any adjustments needed?
**Step 6.3: Wait for user response**
**Step 6.4: Only after user confirms, offer cleanup**After confirming audio quality, temporary files can be deleted with:
rm -rf <cwd>/audio/tmp/
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...