Haven't installed OpenClaw yet? Click here for one-line install commands
curl -fsSL https://openclaw.ai/install.sh | bashiwr -useb https://openclaw.ai/install.ps1 | iexcurl -fsSL https://openclaw.ai/install.cmd -o install.cmd && install.cmd && del install.cmd- OpenClaw supports bidirectional voice interaction: high-quality voice synthesis (TTS) via ElevenLabs, and speech-to-text (STT) via OpenAI Whisper[1]
- Voice features are managed through the unified SAG (Speech-Audio Gateway) module -- just set up the API keys to enable them, no additional hardware required[6]
- In channels that support voice messages like Telegram, you can send voice messages directly to the agent, and the agent can reply with voice -- enabling a true "voice assistant" experience[7]
- ElevenLabs offers over 30 preset voices and custom voice cloning, letting you have the agent speak in your preferred voice[2]
1. Voice Feature Overview
OpenClaw's voice features solve a practical problem: sometimes typing isn't convenient. When you're driving, cooking, or exercising and want the AI agent to do something for you, voice is the most natural way to interact.[5]
Voice interaction includes two directions:
- Voice Input (STT): You speak a command via voice -> Whisper converts it to text -> The agent understands and executes
- Voice Output (TTS): The agent completes a task -> Text result -> ElevenLabs converts it to voice and replies
2. ElevenLabs TTS Voice Synthesis Setup
2.1 Obtain API Key
Go to the ElevenLabs website and register an account (the free plan provides 10,000 characters of voice quota per month). Get your API Key from the Profile page.[2]
2.2 Configure OpenClaw
Write the API Key to OpenClaw settings:[6]
openclaw config set sag.elevenlabs_api_key "your_ELEVENLABS_API_KEY"
Restart the Gateway:
openclaw gateway restart
2.3 Choose a Voice
ElevenLabs provides multiple preset voices. After previewing them on their platform, set the voice ID as the agent's default voice:
openclaw config set sag.elevenlabs_voice_id "VOICE_ID"
ElevenLabs also supports custom voice cloning -- upload voice samples to create a unique voice. This is particularly valuable for enterprise applications requiring brand consistency.[2]
3. Whisper Speech Recognition Setup
3.1 OpenAI Whisper Integration
Whisper is a speech recognition model developed by OpenAI, supporting over 90 languages (including Chinese).[3]
openclaw config set sag.whisper_provider "openai"
Whisper API calls use your existing OpenAI API Key -- no additional authentication is needed.
3.2 Chinese Speech Recognition Quality
Whisper's recognition accuracy for Chinese (Mandarin) exceeds 95% in quiet environments. However, note that:
- Dialects and accents: Strong dialect accents may reduce accuracy
- Background noise: Noise-cancelling microphones are recommended in noisy environments
- Technical terminology: Technical terms (such as API, Docker, Kubernetes) are usually recognized correctly
4. Practical Application Scenarios
4.1 Telegram Voice Commands
In Telegram, you can press and hold the record button and speak your command directly:[7]
"Check the server's disk usage. If it exceeds 80%, tell me which directories are taking up the most space."
After receiving the voice message, Whisper converts it to text, the agent executes the task and replies with text or voice (depending on your settings).
4.2 Voice Reports
Combined with Cron scheduled tasks, the agent can deliver important information to you by voice every morning -- like a personal news anchor.
4.3 Accessible Interaction
Voice features enable visually impaired users or those with limited mobility to operate the AI agent without touching a keyboard or screen.
5. Cost Estimation
| Service | Free Quota | Paid Pricing |
|---|---|---|
| ElevenLabs TTS | 10,000 characters/month | Starting at $5/month (30,000 characters) |
| OpenAI Whisper | No free quota | $0.006/minute |
Estimated for daily use: 10 voice interactions per day, averaging 30 seconds of voice input + 200 characters of voice response each, the monthly cost is approximately $2-$5 USD.
6. Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| No sound in voice replies | ElevenLabs API Key not set or invalid | Verify sag.elevenlabs_api_key is configured correctly |
| High speech recognition error rate | Poor audio quality or background noise | Use a noise-cancelling microphone; record in a quiet environment |
| Chinese commands recognized as English | Whisper language detection error | Start voice input with a clear Chinese sentence |
| Voice reply latency too high | ElevenLabs API response slow | Choose a lower-latency voice model; check network connection |
| Free quota exhausted | ElevenLabs monthly limit depleted | Upgrade plan or temporarily disable TTS and switch to text-only replies |
Conclusion
Voice features elevate OpenClaw from a "text command tool" to a "voice assistant."[1] Setup requires just two API keys and a few commands, but the improvement in interaction experience is a qualitative leap -- especially in scenarios where you can't type.
Voice features depend on channel support. If you haven't set up Telegram yet, we recommend completing the Telegram Integration Guide first. For questions about OpenClaw's complete configuration, refer to the Configuration Complete Guide.



