Haven't installed OpenClaw yet? Click here for one-line install commands
curl -fsSL https://openclaw.ai/install.sh | bash
iwr -useb https://openclaw.ai/install.ps1 | iex
curl -fsSL https://openclaw.ai/install.cmd -o install.cmd && install.cmd && del install.cmd
Worried about affecting your computer? ClawTank runs in the cloud with no installation required, eliminating accidental deletion risks
Key Findings
  • OpenClaw supports bidirectional voice interaction: high-quality voice synthesis (TTS) via ElevenLabs, and speech-to-text (STT) via OpenAI Whisper[1]
  • Voice features are managed through the unified SAG (Speech-Audio Gateway) module -- just set up the API keys to enable them, no additional hardware required[6]
  • In channels that support voice messages like Telegram, you can send voice messages directly to the agent, and the agent can reply with voice -- enabling a true "voice assistant" experience[7]
  • ElevenLabs offers over 30 preset voices and custom voice cloning, letting you have the agent speak in your preferred voice[2]

1. Voice Feature Overview

OpenClaw's voice features solve a practical problem: sometimes typing isn't convenient. When you're driving, cooking, or exercising and want the AI agent to do something for you, voice is the most natural way to interact.[5]

Voice interaction includes two directions:

2. ElevenLabs TTS Voice Synthesis Setup

2.1 Obtain API Key

Go to the ElevenLabs website and register an account (the free plan provides 10,000 characters of voice quota per month). Get your API Key from the Profile page.[2]

2.2 Configure OpenClaw

Write the API Key to OpenClaw settings:[6]

openclaw config set sag.elevenlabs_api_key "your_ELEVENLABS_API_KEY"

Restart the Gateway:

openclaw gateway restart

2.3 Choose a Voice

ElevenLabs provides multiple preset voices. After previewing them on their platform, set the voice ID as the agent's default voice:

openclaw config set sag.elevenlabs_voice_id "VOICE_ID"

ElevenLabs also supports custom voice cloning -- upload voice samples to create a unique voice. This is particularly valuable for enterprise applications requiring brand consistency.[2]

3. Whisper Speech Recognition Setup

3.1 OpenAI Whisper Integration

Whisper is a speech recognition model developed by OpenAI, supporting over 90 languages (including Chinese).[3]

openclaw config set sag.whisper_provider "openai"

Whisper API calls use your existing OpenAI API Key -- no additional authentication is needed.

3.2 Chinese Speech Recognition Quality

Whisper's recognition accuracy for Chinese (Mandarin) exceeds 95% in quiet environments. However, note that:

4. Practical Application Scenarios

4.1 Telegram Voice Commands

In Telegram, you can press and hold the record button and speak your command directly:[7]

"Check the server's disk usage. If it exceeds 80%, tell me which directories are taking up the most space."

After receiving the voice message, Whisper converts it to text, the agent executes the task and replies with text or voice (depending on your settings).

4.2 Voice Reports

Combined with Cron scheduled tasks, the agent can deliver important information to you by voice every morning -- like a personal news anchor.

4.3 Accessible Interaction

Voice features enable visually impaired users or those with limited mobility to operate the AI agent without touching a keyboard or screen.

5. Cost Estimation

ServiceFree QuotaPaid Pricing
ElevenLabs TTS10,000 characters/monthStarting at $5/month (30,000 characters)
OpenAI WhisperNo free quota$0.006/minute

Estimated for daily use: 10 voice interactions per day, averaging 30 seconds of voice input + 200 characters of voice response each, the monthly cost is approximately $2-$5 USD.

6. Troubleshooting

IssueCauseSolution
No sound in voice repliesElevenLabs API Key not set or invalidVerify sag.elevenlabs_api_key is configured correctly
High speech recognition error ratePoor audio quality or background noiseUse a noise-cancelling microphone; record in a quiet environment
Chinese commands recognized as EnglishWhisper language detection errorStart voice input with a clear Chinese sentence
Voice reply latency too highElevenLabs API response slowChoose a lower-latency voice model; check network connection
Free quota exhaustedElevenLabs monthly limit depletedUpgrade plan or temporarily disable TTS and switch to text-only replies

Conclusion

Voice features elevate OpenClaw from a "text command tool" to a "voice assistant."[1] Setup requires just two API keys and a few commands, but the improvement in interaction experience is a qualitative leap -- especially in scenarios where you can't type.

Voice features depend on channel support. If you haven't set up Telegram yet, we recommend completing the Telegram Integration Guide first. For questions about OpenClaw's complete configuration, refer to the Configuration Complete Guide.