Model Selection
Each provider auto-selects a sensible default model with automatic fallback to a lighter model on quota or availability errors. Most users should never override the model parameter — the defaults are tuned for quality and the fallback chain handles failures.
Defaults & Fallbacks
| Provider | Default | Fallback | Trigger |
|---|---|---|---|
| Gemini | gemini-3.1-pro-preview | gemini-3-flash-preview | RESOURCE_EXHAUSTED quota error or "exhausted your capacity" pattern (ADR-044) |
| Codex | gpt-5.5 | gpt-5.5-mini | Quota errors (rate_limit_exceeded, 429, insufficient_quota) (ADR-028, bumped per ADR-067) |
| Ollama | qwen2.5-coder:7b | qwen2.5-coder:1.5b | Model-not-found error (e.g., 7b not pulled but 1.5b is) |
The fallback fires automatically inside the executor — your client sees a successful response with usage.fellBack: true in the structured output, and a [Gemini stats: ... fell back] annotation in the formatted text.
Choosing a Provider
Different providers excel at different things. Pick by what you're doing, not by which is "best":
| Task | Suggested provider | Why |
|---|---|---|
| Whole-codebase review | Gemini | 1M+ token context fits things others can't |
| Targeted code reasoning, refactor critique | Codex | GPT-5.5's strength is dense code reasoning at moderate context size |
| Private / air-gapped analysis | Ollama | Runs locally, nothing leaves your machine |
| "What do they all think?" comparison | Multi-LLM (multi-llm tool or /compare skill) | Parallel dispatch, see all responses side-by-side |
| Code review with verified findings | /multi-review skill | Gemini + Codex in parallel, then verifies each finding against source |
Overriding the Model
Pass model explicitly when you have a reason to:
Use ask-llm with provider gemini and model gemini-3-flash-preview to quickly check this CSS fileOr programmatically:
{ "name": "ask-llm", "arguments": { "provider": "gemini", "model": "gemini-3-flash-preview", "prompt": "..." } }For Codex, common overrides:
Use ask-codex with model gpt-5.5-mini to summarize this commitFor Ollama, you can request any model you've pulled:
ollama pull deepseek-coder:6.7bUse ask-ollama with model deepseek-coder:6.7b to review this implementationToken Limits & Cost
| Provider | Context window | Cost model |
|---|---|---|
| Gemini Pro | ~1M tokens (~250k LOC) | Free tier via OAuth, paid tiers via API key |
| Gemini Flash | ~1M tokens | Cheaper than Pro; fallback target for quota relief |
| Codex GPT-5.5 | Per OpenAI's published context window | Per OpenAI billing |
| Codex GPT-5.5-mini | Smaller context | Cheaper; fallback target |
| Ollama | Per model (e.g., 32k for qwen2.5-coder) | Free — runs locally |
Track What You're Spending
Token usage is exposed live via:
- Per-call:
result.structuredContent.usageon everyask-*tool response (provider, model, inputTokens, outputTokens, cachedTokens, thinkingTokens, durationMs, fellBack) — see ADR-054 - Per-session aggregate: call the
get-usage-statsMCP tool, or read theusage://current-sessionMCP Resource for a JSON snapshot - In the REPL: type
/usagefor a markdown-formatted breakdown
This is in-memory only — no persistence to disk, resets when the MCP server restarts.
Recommendations by Use Case
- General code review → defaults are correct; let the fallback chain handle quota
- Whole-codebase analysis →
ask-gemini(Pro) — Flash auto-kicks in if Pro is exhausted - Quick fixes, fast iteration → request Flash or
gpt-5.5-miniexplicitly to skip the Pro→fallback round-trip - Privacy-sensitive code →
ask-ollama, never leaves your machine - Multi-perspective debate →
multi-llmor/brainstormskill — Claude weighs verified vs inferred