HomeBrowseUpload
← Back to registry
// Skill profile

Voice Assistant

name: voice-assistant

by charantejmandali18 · published 2026-03-22

数据处理API集成
Total installs
0
Stars
★ 0
Last updated
2026-03
// Install command
$ claw add gh:charantejmandali18/charantejmandali18-voice-assistant
View on GitHub
// Full documentation

---

name: voice-assistant

description: Real-time voice assistant for OpenClaw. Streams mic audio through configurable STT (Deepgram or ElevenLabs) into your OpenClaw agent, then speaks the response via configurable TTS (Deepgram Aura or ElevenLabs). Sub-2s time-to-first-audio with full streaming at every stage.

metadata:

openclaw:

emoji: "🎙️"

requires:

bins: ["uv"]

env: []

primaryEnv: "VOICE_STT_PROVIDER"

install:

- id: "uv"

kind: "brew"

formula: "uv"

bins: ["uv"]

label: "Install uv (brew)"

---

# Voice Assistant

Real-time voice interface for your OpenClaw agent. Talk to your agent and hear it respond — with configurable STT and TTS providers, full streaming at every stage, and sub-2 second time-to-first-audio.

Architecture

Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text
  → OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text
  → TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker

The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.

Quick Start

cd {baseDir}
cp .env.example .env
# Fill in your API keys and gateway URL
uv run scripts/server.py
# Open http://localhost:7860 and click the mic

Supported Providers

STT (Speech-to-Text)

| Provider | Model | Latency | Notes |

|-----------|------------------|----------|------------------------------|

| Deepgram | nova-2 (streaming) | ~200-300ms | WebSocket streaming, best accuracy/speed |

| ElevenLabs | Scribe v1 | ~300-500ms | REST-based, good multilingual |

TTS (Text-to-Speech)

| Provider | Model | Latency | Notes |

|------------|--------------|----------|--------------------------------|

| Deepgram | aura-2 | ~200ms | WebSocket streaming, low cost |

| ElevenLabs | Turbo v2.5 | ~300ms | Best voice quality, streaming |

Configuration

All configuration is via environment variables in `.env`:

# === Required ===
OPENCLAW_GATEWAY_URL=http://localhost:4141/v1    # Your OpenClaw gateway
OPENCLAW_MODEL=claude-sonnet-4-5-20250929        # Model your gateway routes to

# === STT Provider (pick one) ===
VOICE_STT_PROVIDER=deepgram                      # "deepgram" or "elevenlabs"
DEEPGRAM_API_KEY=your-key-here                   # Required if STT=deepgram
ELEVENLABS_API_KEY=your-key-here                 # Required if STT=elevenlabs

# === TTS Provider (pick one) ===
VOICE_TTS_PROVIDER=elevenlabs                    # "deepgram" or "elevenlabs"
# Uses the same API keys as above

# === Optional Tuning ===
VOICE_TTS_VOICE=rachel                           # ElevenLabs voice name/ID
VOICE_TTS_VOICE_DG=aura-2-theia-en              # Deepgram Aura voice
VOICE_VAD_SILENCE_MS=400                         # Silence before end-of-turn (ms)
VOICE_SAMPLE_RATE=16000                          # Audio sample rate
VOICE_SERVER_PORT=7860                           # Server port
VOICE_SYSTEM_PROMPT=""                           # Optional system prompt override

Provider Combinations

| Setup | Best For |

|------------------------------------|---------------------------------|

| Deepgram STT + ElevenLabs TTS | Best quality voice output |

| Deepgram STT + Deepgram TTS | Lowest latency, single vendor |

| ElevenLabs STT + ElevenLabs TTS | Best multilingual support |

How It Works

1. **Browser captures mic audio** via Web Audio API and streams raw PCM over a WebSocket

2. **Server receives audio** and pipes it to the configured STT provider's streaming endpoint

3. **STT returns partial transcripts** in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway

4. **OpenClaw gateway streams** the LLM response token-by-token via SSE (Server-Sent Events)

5. **Tokens are accumulated** into sentence-sized chunks and streamed to the TTS provider

6. **TTS returns audio chunks** that are immediately forwarded to the browser over the same WebSocket

7. **Browser plays audio** using the Web Audio API with a jitter buffer for smooth playback

Interruption Handling (Barge-In)

When the user starts speaking while the agent is still talking:

  • Current TTS audio is immediately cancelled
  • The agent stops its current response
  • New STT session begins capturing the user's interruption
  • Usage Examples

    User: "Hey, set up my voice assistant"
    → OpenClaw runs: cd {baseDir} && cp .env.example .env
    → Opens .env for the user to fill in API keys
    → Runs: uv run scripts/server.py
    
    User: "Start a voice chat"
    → Opens http://localhost:7860 in the browser
    
    User: "Switch TTS to Deepgram"
    → Updates VOICE_TTS_PROVIDER=deepgram in .env
    → Restarts the server

    Troubleshooting

  • **No audio output?** Check that your TTS API key is valid and the provider is set correctly
  • **High latency?** Use Deepgram for both STT and TTS; ensure your gateway is on the same network
  • **Cuts off speech?** Increase `VOICE_VAD_SILENCE_MS` to 600-800ms
  • **Echo/feedback?** Use headphones, or enable the built-in echo cancellation in the browser UI
  • Latency Budget

    | Stage | Target | Actual (typical) |

    |-------------------------|-----------|------------------|

    | Audio capture + VAD | <200ms | ~100-150ms |

    | STT transcription | <400ms | ~200-400ms |

    | OpenClaw LLM first token| <1500ms | ~500-1500ms |

    | TTS first audio chunk | <400ms | ~200-400ms |

    | **Total first audio** | **<2.5s** | **~1.0-2.5s** |

    // Comments
    Sign in with GitHub to leave a comment.
    // Related skills

    More tools from the same signal band