How It Works

Ask LLM is a set of MCP servers that bridge your AI client (Claude Code, Claude Desktop, Cursor, etc.) with up to three LLM providers running locally on your machine: Google's Gemini CLI, OpenAI's Codex CLI, and Ollama (local models). Your client decides when to delegate work to one or more providers based on what you ask.

Natural Language Workflow

Your client (typically Claude) decides when to call the MCP tools based on context:

🔍 comparative analysis — different AI perspectives for validation (multi-llm, /compare)
📋 code review & big changes — second opinions on implementation (/gemini-review, /codex-review, /multi-review)
📚 large-context analysis — Gemini's 1M+ token window for whole-codebase reads
💡 creative problem solving — /brainstorm for multi-LLM ideation with Claude Opus as a peer
🔒 private analysis — Ollama for code that can't leave the machine

This intelligent selection happens automatically — you just ask in natural language.

Request Flow

⇣ A typical multi-provider call ↴

Click to enlarge

For a single-provider call (ask-llm with provider: "gemini"), only one of the provider lanes fires. For multi-llm, both fire in parallel via Promise.all inside the MCP server process — per-provider failures are isolated, so one provider hitting quota doesn't fail the whole call (ADR-066).

What's Inside the MCP Server

Each provider's executor wraps the underlying CLI with operational hardening that took multiple ADRs to get right:

Quota fallback — Gemini Pro → Flash on RESOURCE_EXHAUSTED (ADR-044); Codex gpt-5.5 → gpt-5.5-mini on quota errors (ADR-028, model bumped in ADR-067)
Stdin handling — Codex needs an EOF-terminated pipe rather than /dev/null, otherwise it errors out (ADR-042)
PATH resolution — macOS GUI clients (Claude Desktop) don't inherit your shell's PATH; the server resolves it from your login shell at startup (ADR-047)
Live progressive output — Gemini's --output-format stream-json deltas are parsed and forwarded to MCP progress notifications, so users see Gemini's prose unfolding rather than a frozen wait (ADR-057)
Session continuity — all three providers support multi-turn via the sessionId parameter; Gemini and Codex use native CLI resume, Ollama uses server-side conversation replay (ADR-058, ADR-063)
Structured responses — every ask-* tool returns both human-readable text AND a structured AskResponse (provider, response, model, sessionId, usage) via MCP outputSchema so programmatic clients don't have to parse the response footer (ADR-065)

You don't need to think about any of this — it's just the infrastructure that makes the natural-language flow work reliably.

When to Use Which Tool

Situation	Tool
Single-provider question, want it to work	`ask-llm` (orchestrator routes by `provider` param)
Compare what multiple providers say	`multi-llm` (or `/compare` skill in Claude Code)
Code review with verified findings	`/multi-review` skill (verifies each finding against source)
Brainstorm with multi-LLM consensus	`/brainstorm` skill (Claude Opus as peer participant)
Large-context analysis (whole codebase)	`ask-gemini` directly (1M+ token context)
Structured code edits to apply	`ask-gemini-edit` (returns OLD/NEW blocks)
Air-gapped / private	`ask-llm` with `provider: "ollama"`
Diagnose setup problems	`npx ask-llm-mcp doctor` (CLI) or `diagnose` (MCP tool)

See How to Ask for full parameter reference and Strategies & Examples for proven workflow patterns.

How It Works ​

Natural Language Workflow ​

Request Flow ​

What's Inside the MCP Server ​

When to Use Which Tool ​

How It Works

Natural Language Workflow

Request Flow

What's Inside the MCP Server

When to Use Which Tool