Integrating Whisper MCP Server with Claude Code
Overview
This guide shows how to integrate Whisper MCP server (mcp-server-whisper) with Claude Code to enable direct audio transcription. With this integration, you can:
- Drop voice recordings → Claude transcribes → Get text files back
- Create content faster by speaking instead of typing
- Work with AI more efficiently using natural voice input
Real-world example: Tennis coaching – transcribe 40-minute match recordings for analysis
Why Customize This MCP Server?
The original Whisper MCP server uses OpenAI’s paid API for transcription. While this works well for small files, it has several limitations:
Cost and Privacy Concerns: OpenAI charges per minute of audio transcription, which adds up quickly when processing multiple long recordings. Additionally, all your audio data is sent to OpenAI’s servers.
File Size Restrictions: The standard setup has a 25MB file size limit, requiring you to manually compress or split larger files before transcription.
Limited Control: You’re completely dependent on OpenAI’s service availability and pricing changes.
Drop-and-Process Simplicity: Just drop audio files into the ./audio_file directory and ask Claude to transcribe. The server handles everything automatically – chunking, compression, transcription, and combining results into a single text file.
Full Control & Customization: Since you own the infrastructure, you can modify the transcription process, add custom features, and never worry about third-party service availability or pricing changes.
Prerequisites
Before starting, ensure you have:
- ✅ Node.js installed
- ✅ MCP CLI installed
- ✅ Python & uv installed
- ✅ Git (to clone repository)
Quick verification:
node --version
mcp --version
uv --version
git --version
Installation Steps
Step 1: Clone Repository
git clone https://github.com/arcaputo3/mcp-server-whisper.git
cd mcp-server-whisper
Step 2: Install Python Dependencies
# Install dependencies with uv
uv sync
# Verify installation
uv run pytest # Optional: run tests
Step 3: Create Audio Directory
mkdir -p ./audio_file
ls -la ./audio_file
Configuration Files
The Whisper MCP integration requires two configuration files:
1. ~/.claude.json (MCP Server Configuration)
{
"mcpServers": {
"whisper": {
"command": "mcp",
"args": ["dev", "/absolute/path/to/mcp-server-whisper/src/mcp_server_whisper/server.py"],
"env": {
"USE_CUSTOM_WHISPER": "true",
"CUSTOM_WHISPER_ENDPOINT": "https://whisper.adventuretube.net/whisper",
"AUDIO_FILES_PATH": "./audio_file"
}
}
}
}
2. .env File (Environment Variables)
AUDIO_FILES_PATH=./audio_file
USE_CUSTOM_WHISPER=true
CUSTOM_WHISPER_ENDPOINT=https://whisper.adventuretube.net/whisper
Code Modifications Made
To use a custom Whisper endpoint instead of OpenAI:
1. Added httpx for HTTP Requests
import httpx
2. Added Configuration Variables
CUSTOM_WHISPER_ENDPOINT = os.getenv("CUSTOM_WHISPER_ENDPOINT", "https://whisper.adventuretube.net/whisper")
USE_CUSTOM_WHISPER = os.getenv("USE_CUSTOM_WHISPER", "false").lower() == "true"
3. Created Custom Whisper Function
async def transcribe_with_custom_whisper(file_path: Path) -> dict[str, Any]:
"""Transcribe audio using custom Whisper endpoint."""
# Handles 10-minute chunking automatically
# Sends to custom endpoint via HTTP POST
# Returns combined transcript
4. Added httpx Dependency
mcp = FastMCP("whisper", dependencies=["openai", "pydub", "aiofiles", "httpx"])
Architecture: How It Works
Component Flow
User → Claude Code → MCP CLI → Whisper MCP Server → Custom Whisper API → Transcript
Key Components
- Claude Code (MCP Client) – User interface where commands are issued
- MCP CLI (Bridge) – Launches Whisper MCP server as stdio process
- Whisper MCP Server (Translation Layer) – Processes audio files (chunking, compression), speaks HTTP with Whisper API
- Custom Whisper API – Performs actual transcription
Using Whisper in Claude Code
Basic Transcription
> Claude, transcribe the audio file in ./audio_file
Transcribe Latest File
> Transcribe my latest recording
Transcribe and Analyze
> Transcribe match_recording.WAV and analyze the key points
Batch Processing
> Find all my recordings from this week and transcribe them
Features Added by Customization
These features are NOT in the original Whisper MCP server:
1. Automatic 10-Minute Chunking
Prevents timeout errors on long files. Splits audio into manageable segments and processes each chunk independently.
2. MP3 Compression
Reduces file size (9.2MB vs 230MB chunks). Faster uploads and saves bandwidth.
3. Individual Chunk Transcripts
Each chunk gets its own .txt file. Useful for debugging failed chunks and allows partial transcription recovery.
4. Combined Transcript with Segment Markers
Merges all chunks into single file. Adds [Segment N] markers.
5. Graceful Failure Handling
Continues processing even if some chunks fail. Reports which chunks succeeded/failed. Saves partial results.
6. Progress Tracking
Real-time updates for each chunk. Shows processing status and provides transparency.
Real-World Example: Tennis Coaching
The Challenge
- Record 40-minute tennis match commentary
- Need text transcription for analysis
- Want to track scores, shots, and player observations
- Manual transcription takes hours
The Solution
- Record Match: Use phone/recorder to capture live commentary
- Drop Audio File: Save
.WAVfile to./audio_file/directory - Ask Claude: “Transcribe match_recording.WAV”
- Get Results: Automated transcription in minutes
File Structure Created
audio_file/
├── DJI_32_20251027_190850.WAV (original 230MB file)
└── chunks_DJI_32_20251027_190850/
├── DJI_32_20251027_190850_chunk_01.mp3 (9.6MB)
├── DJI_32_20251027_190850_chunk_01.txt (individual transcript)
├── ...
└── DJI_32_20251027_190850_chunk_04.txt
text_file/
└── DJI_32_20251027_190850.txt (combined transcript)
Tennis Data Successfully Captured
- Score tracking: “15-0”, “love-15”, “40-30”, “deuce”
- Shot analysis: “backhand down the line”, “forehand winner”, “double fault”
- Player observations: “UTR 9.8”, “consistent serve”, “weak backhand”
- Match progression: Game-by-game commentary
Results
- Processing time: ~15 minutes for 40-minute audio
- Chunks processed: 4 chunks (3 x 10min + 1 x 10min)
- Success rate: 100% (all chunks transcribed)
- Output quality: Accurate tennis terminology recognition
Troubleshooting
Issue #1: Server Won’t Connect
/mcp
✘ failed · Failed to reconnect to whisper
Solutions:
- Use absolute paths in
~/.claude.json - Verify MCP CLI installation:
mcp --version - Reinstall dependencies:
cd /path/to/mcp-server-whisper && uv sync
Issue #2: Custom Endpoint Not Working
Solutions:
- Check
USE_CUSTOM_WHISPERis set totrue - Verify endpoint is accessible:
curl https://whisper.adventuretube.net/whisper - Update
~/.claude.jsonwith correct values
Issue #3: Large Files Failing
This should NOT happen with the custom endpoint setup. If it does:
- Verify
USE_CUSTOM_WHISPER=trueis set - Check that
transcribe_with_custom_whisperfunction is being called - Look for errors in chunk processing logs
Summary
- ✅ Custom stdio MCP server (modified
mcp-server-whisper) - ✅ User-scoped installation (available across all projects)
- ✅ Absolute paths for all file references
- ✅ Direct custom Whisper API communication
- ✅ Automatic 10-minute chunking for large files
