Audio Transcription with Sber Salute Speech
name: salute-speech
by chorus12 · published 2026-03-22
$ claw add gh:chorus12/chorus12-salute-speech---
name: salute-speech
description: >
Transcribe audio files using Sber Salute Speech async API.
Russian-first STT with support for ru-RU, en-US, kk-KZ, ky-KG, uz-UZ.
metadata: { "openclaw": { "requires": { "bins": ["uv"], "env": ["SALUTE_AUTH_DATA"] }, "primaryEnv": "SALUTE_AUTH_DATA" } }
---
# Audio Transcription with Sber Salute Speech
Transcribe audio/video files to text with timestamps via Salute Speech async REST API.
Requirements
Supported formats & encodings
| Audio encoding | Content-Type | Typical extensions |
|---------------|-------------|--------------------|
| `MP3` | `audio/mpeg` | `.mp3` |
| `PCM_S16LE` | `audio/wav` | `.wav` |
| `OPUS` | `audio/ogg` | `.ogg`, `.opus` |
| `FLAC` | `audio/flac` | `.flac` |
| `ALAW` | `audio/alaw` | `.alaw` |
| `MULAW` | `audio/mulaw` | `.mulaw` |
Supported languages
`ru-RU`, `en-US`, `kk-KZ` (Kazakh), `ky-KG` (Kyrgyz), `uz-UZ` (Uzbek).
Workflow
1. **Identify input files** — from user request.
2. **Read API key** from host environment.
3. **Run transcription** — execute `salute_transcribe.py` with `uv` and appropriate arguments.
4. **Deliver results** — present to user human-readable transcript with timestamps to the user and give a direct link to files.
Usage
uv run --with requests {baseDir}/salute_transcribe.py \
--file /path/to/audio.mp3 \
--output_dir ~/.openclaw/workspace/transcriptions \
--lang ru-RUArguments
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `--file` | **Yes** | — | Path to audio/video file |
| `--output_dir` | No | `~/.openclaw/workspace/transcribations` | Output directory for results |
| `--lang` | No | `ru-RU` | Language code: `ru-RU`, `en-US`, `kk-KZ`, `ky-KG`, `uz-UZ` |
| `--audio-encoding` | No | `MP3` | Codec: `MP3`, `PCM_S16LE`, `OPUS`, `FLAC`, `ALAW`, `MULAW` |
| `--model` | No | `general` | Recognition model: `general` or `callcenter` |
| `--hyp-count` | No | `1` | Number of alternative hypotheses: `1` or `2` |
| `--max-wait-time` | No | `300` | Max seconds to wait for async result |
| `--print` | No | off | Also print transcription to stdout |
Content-Type mapping
When the file extension doesn't match `audio/mpeg`, adjust `content_type` in the script or add logic. Current default is `audio/mpeg` (MP3). For `.wav` files use `audio/wav`, etc.
Output files
For input file `meetingABC.mp3` the script produces:
| File | Description |
|------|-------------|
| `meetingABC_recognition_orig.json` | Raw API response (full JSON with all hypotheses, timing, confidence) |
| `meetingABC_pretty.txt` | Formatted human-readable transcript with timestamps |
Output text format
[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.
[00:20 - 00:45]:
Следующий фрагмент текста здесь.Notes
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...