Integrating Whisper MCP Server with Claude Code

Overview

This guide shows how to integrate Whisper MCP server (mcp-server-whisper) with Claude Code to enable direct audio transcription. With this integration, you can:

Drop voice recordings → Claude transcribes → Get text files back
Create content faster by speaking instead of typing
Work with AI more efficiently using natural voice input

Real-world example: Tennis coaching – transcribe 40-minute match recordings for analysis

Why Customize This MCP Server?

The original Whisper MCP server uses OpenAI’s paid API for transcription. While this works well for small files, it has several limitations:

Cost and Privacy Concerns: OpenAI charges per minute of audio transcription, which adds up quickly when processing multiple long recordings. Additionally, all your audio data is sent to OpenAI’s servers.

File Size Restrictions: The standard setup has a 25MB file size limit, requiring you to manually compress or split larger files before transcription.

Limited Control: You’re completely dependent on OpenAI’s service availability and pricing changes.

Drop-and-Process Simplicity: Just drop audio files into the ./audio_file directory and ask Claude to transcribe. The server handles everything automatically – chunking, compression, transcription, and combining results into a single text file.

Full Control & Customization: Since you own the infrastructure, you can modify the transcription process, add custom features, and never worry about third-party service availability or pricing changes.

Prerequisites

Before starting, ensure you have:

✅ Node.js installed
✅ MCP CLI installed
✅ Python & uv installed
✅ Git (to clone repository)

Quick verification:

node --version
mcp --version
uv --version
git --version

Installation Steps

Step 1: Clone Repository

git clone https://github.com/arcaputo3/mcp-server-whisper.git
cd mcp-server-whisper

Step 2: Install Python Dependencies

# Install dependencies with uv
uv sync

# Verify installation
uv run pytest  # Optional: run tests

Step 3: Create Audio Directory

mkdir -p ./audio_file
ls -la ./audio_file

Configuration Files

The Whisper MCP integration requires two configuration files:

1. ~/.claude.json (MCP Server Configuration)

{
  "mcpServers": {
    "whisper": {
      "command": "mcp",
      "args": ["dev", "/absolute/path/to/mcp-server-whisper/src/mcp_server_whisper/server.py"],
      "env": {
        "USE_CUSTOM_WHISPER": "true",
        "CUSTOM_WHISPER_ENDPOINT": "https://whisper.adventuretube.net/whisper",
        "AUDIO_FILES_PATH": "./audio_file"
      }
    }
  }
}

2. .env File (Environment Variables)

AUDIO_FILES_PATH=./audio_file
USE_CUSTOM_WHISPER=true
CUSTOM_WHISPER_ENDPOINT=https://whisper.adventuretube.net/whisper

Code Modifications Made

To use a custom Whisper endpoint instead of OpenAI:

1. Added httpx for HTTP Requests

import httpx

2. Added Configuration Variables

CUSTOM_WHISPER_ENDPOINT = os.getenv("CUSTOM_WHISPER_ENDPOINT", "https://whisper.adventuretube.net/whisper")
USE_CUSTOM_WHISPER = os.getenv("USE_CUSTOM_WHISPER", "false").lower() == "true"

3. Created Custom Whisper Function

async def transcribe_with_custom_whisper(file_path: Path) -> dict[str, Any]:
    """Transcribe audio using custom Whisper endpoint."""
    # Handles 10-minute chunking automatically
    # Sends to custom endpoint via HTTP POST
    # Returns combined transcript

4. Added httpx Dependency

mcp = FastMCP("whisper", dependencies=["openai", "pydub", "aiofiles", "httpx"])

Architecture: How It Works

Component Flow

User → Claude Code → MCP CLI → Whisper MCP Server → Custom Whisper API → Transcript

Key Components

Claude Code (MCP Client) – User interface where commands are issued
MCP CLI (Bridge) – Launches Whisper MCP server as stdio process
Whisper MCP Server (Translation Layer) – Processes audio files (chunking, compression), speaks HTTP with Whisper API
Custom Whisper API – Performs actual transcription

Using Whisper in Claude Code

Basic Transcription

> Claude, transcribe the audio file in ./audio_file

Transcribe Latest File

> Transcribe my latest recording

Transcribe and Analyze

> Transcribe match_recording.WAV and analyze the key points

Batch Processing

> Find all my recordings from this week and transcribe them

Features Added by Customization

These features are NOT in the original Whisper MCP server:

1. Automatic 10-Minute Chunking

Prevents timeout errors on long files. Splits audio into manageable segments and processes each chunk independently.

2. MP3 Compression

Reduces file size (9.2MB vs 230MB chunks). Faster uploads and saves bandwidth.

3. Individual Chunk Transcripts

Each chunk gets its own .txt file. Useful for debugging failed chunks and allows partial transcription recovery.

4. Combined Transcript with Segment Markers

Merges all chunks into single file. Adds [Segment N] markers.

5. Graceful Failure Handling

Continues processing even if some chunks fail. Reports which chunks succeeded/failed. Saves partial results.

6. Progress Tracking

Real-time updates for each chunk. Shows processing status and provides transparency.

Real-World Example: Tennis Coaching

The Challenge

Record 40-minute tennis match commentary
Need text transcription for analysis
Want to track scores, shots, and player observations
Manual transcription takes hours

The Solution

Record Match: Use phone/recorder to capture live commentary
Drop Audio File: Save .WAV file to ./audio_file/ directory
Ask Claude: “Transcribe match_recording.WAV”
Get Results: Automated transcription in minutes

File Structure Created

audio_file/
├── DJI_32_20251027_190850.WAV          (original 230MB file)
└── chunks_DJI_32_20251027_190850/
    ├── DJI_32_20251027_190850_chunk_01.mp3  (9.6MB)
    ├── DJI_32_20251027_190850_chunk_01.txt  (individual transcript)
    ├── ...
    └── DJI_32_20251027_190850_chunk_04.txt

text_file/
└── DJI_32_20251027_190850.txt           (combined transcript)

Tennis Data Successfully Captured

Score tracking: “15-0”, “love-15”, “40-30”, “deuce”
Shot analysis: “backhand down the line”, “forehand winner”, “double fault”
Player observations: “UTR 9.8”, “consistent serve”, “weak backhand”
Match progression: Game-by-game commentary

Results

Processing time: ~15 minutes for 40-minute audio
Chunks processed: 4 chunks (3 x 10min + 1 x 10min)
Success rate: 100% (all chunks transcribed)
Output quality: Accurate tennis terminology recognition

Troubleshooting

Issue #1: Server Won’t Connect

/mcp
✘ failed · Failed to reconnect to whisper

Solutions:

Use absolute paths in ~/.claude.json
Verify MCP CLI installation: mcp --version
Reinstall dependencies: cd /path/to/mcp-server-whisper && uv sync

Issue #2: Custom Endpoint Not Working