OpenClaw + Whisper + ElevenLabs: Voice & Speech-to-Text Setup

Haven't installed OpenClaw yet? Click here for one-line install commands

macOS / Linux PowerShell CMD

curl -fsSL https://openclaw.ai/install.sh | bash

iwr -useb https://openclaw.ai/install.ps1 | iex

curl -fsSL https://openclaw.ai/install.cmd -o install.cmd && install.cmd && del install.cmd

Worried about affecting your computer? ClawTank runs in the cloud with no installation required, eliminating accidental deletion risks

Key Findings

OpenClaw supports bidirectional voice interaction: high-quality voice synthesis (TTS) via ElevenLabs, and speech-to-text (STT) via OpenAI Whisper^[1]
Voice features are managed through the unified SAG (Speech-Audio Gateway) module -- just set up the API keys to enable them, no additional hardware required^[6]
In channels that support voice messages like Telegram, you can send voice messages directly to the agent, and the agent can reply with voice -- enabling a true "voice assistant" experience^[7]
ElevenLabs offers over 30 preset voices and custom voice cloning, letting you have the agent speak in your preferred voice^[2]

1. Voice Feature Overview

OpenClaw's voice features solve a practical problem: sometimes typing isn't convenient. When you're driving, cooking, or exercising and want the AI agent to do something for you, voice is the most natural way to interact.^[5]

Voice interaction includes two directions:

Voice Input (STT): You speak a command via voice -> Whisper converts it to text -> The agent understands and executes
Voice Output (TTS): The agent completes a task -> Text result -> ElevenLabs converts it to voice and replies

2. ElevenLabs TTS Voice Synthesis Setup

2.1 Obtain API Key

Go to the ElevenLabs website and register an account (the free plan provides 10,000 characters of voice quota per month). Get your API Key from the Profile page.^[2]

2.2 Configure OpenClaw

Write the API Key to OpenClaw settings:^[6]

openclaw config set sag.elevenlabs_api_key "your_ELEVENLABS_API_KEY"

Restart the Gateway:

openclaw gateway restart

2.3 Choose a Voice

ElevenLabs provides multiple preset voices. After previewing them on their platform, set the voice ID as the agent's default voice:

openclaw config set sag.elevenlabs_voice_id "VOICE_ID"

ElevenLabs also supports custom voice cloning -- upload voice samples to create a unique voice. This is particularly valuable for enterprise applications requiring brand consistency.^[2]

3. Whisper Speech Recognition Setup

3.1 OpenAI Whisper Integration

Whisper is a speech recognition model developed by OpenAI, supporting over 90 languages (including Chinese).^[3]

openclaw config set sag.whisper_provider "openai"

Whisper API calls use your existing OpenAI API Key -- no additional authentication is needed.

3.2 Chinese Speech Recognition Quality

Whisper's recognition accuracy for Chinese (Mandarin) exceeds 95% in quiet environments. However, note that:

Dialects and accents: Strong dialect accents may reduce accuracy
Background noise: Noise-cancelling microphones are recommended in noisy environments
Technical terminology: Technical terms (such as API, Docker, Kubernetes) are usually recognized correctly

4. Practical Application Scenarios

4.1 Telegram Voice Commands

In Telegram, you can press and hold the record button and speak your command directly:^[7]

"Check the server's disk usage. If it exceeds 80%, tell me which directories are taking up the most space."

After receiving the voice message, Whisper converts it to text, the agent executes the task and replies with text or voice (depending on your settings).

4.2 Voice Reports

Combined with Cron scheduled tasks, the agent can deliver important information to you by voice every morning -- like a personal news anchor.

4.3 Accessible Interaction

Voice features enable visually impaired users or those with limited mobility to operate the AI agent without touching a keyboard or screen.

5. Cost Estimation

Service	Free Quota	Paid Pricing
ElevenLabs TTS	10,000 characters/month	Starting at $5/month (30,000 characters)
OpenAI Whisper	No free quota	$0.006/minute

Estimated for daily use: 10 voice interactions per day, averaging 30 seconds of voice input + 200 characters of voice response each, the monthly cost is approximately $2-$5 USD.

6. Troubleshooting

Issue	Cause	Solution
No sound in voice replies	ElevenLabs API Key not set or invalid	Verify `sag.elevenlabs_api_key` is configured correctly
High speech recognition error rate	Poor audio quality or background noise	Use a noise-cancelling microphone; record in a quiet environment
Chinese commands recognized as English	Whisper language detection error	Start voice input with a clear Chinese sentence
Voice reply latency too high	ElevenLabs API response slow	Choose a lower-latency voice model; check network connection
Free quota exhausted	ElevenLabs monthly limit depleted	Upgrade plan or temporarily disable TTS and switch to text-only replies

Conclusion

Voice features elevate OpenClaw from a "text command tool" to a "voice assistant."^[1] Setup requires just two API keys and a few commands, but the improvement in interaction experience is a qualitative leap -- especially in scenarios where you can't type.

Voice features depend on channel support. If you haven't set up Telegram yet, we recommend completing the Telegram Integration Guide first. For questions about OpenClaw's complete configuration, refer to the Configuration Complete Guide.

References

OpenClaw Documentation. (2026). Voice & Audio — OpenClaw Official Docs. docs.openclaw.ai

ElevenLabs. (2025). API Documentation — ElevenLabs. docs.elevenlabs.io

OpenAI. (2024). Whisper — Large-Scale Weak Supervised Speech Recognition. OpenAI. openai.com

OpenClaw Documentation. (2026). Getting Started — OpenClaw Official Docs. docs.openclaw.ai

Scientific American. (2026). OpenClaw is an open-source AI agent that runs your computer. Scientific American. scientificamerican.com

OpenClaw Documentation. (2026). SAG (Speech-Audio Gateway) Configuration. docs.openclaw.ai

OpenClaw Documentation. (2026). Channels — Telegram Integration. docs.openclaw.ai

OpenClaw + Whisper + ElevenLabs: Voice & Speech-to-Text Setup

1. Voice Feature Overview

2. ElevenLabs TTS Voice Synthesis Setup

2.1 Obtain API Key

2.2 Configure OpenClaw

2.3 Choose a Voice

3. Whisper Speech Recognition Setup

3.1 OpenAI Whisper Integration

3.2 Chinese Speech Recognition Quality

4. Practical Application Scenarios

4.1 Telegram Voice Commands

4.2 Voice Reports

4.3 Accessible Interaction

5. Cost Estimation

6. Troubleshooting

Conclusion

OpenClaw Agent Setup Complete Guide: From Creation and Configuration to Advanced Management

Recommended Reading

Deploy OpenClaw
In Under 1 Minute

References

1. Voice Feature Overview

2. ElevenLabs TTS Voice Synthesis Setup

2.1 Obtain API Key

2.2 Configure OpenClaw

2.3 Choose a Voice

3. Whisper Speech Recognition Setup

3.1 OpenAI Whisper Integration

3.2 Chinese Speech Recognition Quality

4. Practical Application Scenarios

4.1 Telegram Voice Commands

4.2 Voice Reports

4.3 Accessible Interaction

5. Cost Estimation

6. Troubleshooting

Conclusion

OpenClaw Agent Setup Complete Guide: From Creation and Configuration to Advanced Management

Subscribe to our newsletter

Related Insights

OpenClaw Telegram Integration Complete Guide: From Bot Creation to Remote AI Agent Control

OpenClaw Configuration Complete Guide: Core Settings from openclaw.json to Model Management

OpenClaw Skills System Complete Guide

Recommended Reading

OpenClaw Agent Setup Complete Guide: From Creation and Configuration to Advanced Management

OpenClaw Browser Agent Complete Guide: From Web Operations to Data Extraction

OpenClaw Agents Command Guide: add, list, config & Model Configuration Deep Dive

OpenClaw CMD One-Click Install in Practice: install.cmd Script Analysis, Onboard 2026.2.25 New Features & Gateway Foreground Mode Complete Record

Deploy OpenClaw In Under 1 Minute

References

Deploy OpenClaw
In Under 1 Minute