How We Built an Autonomous AI Hockey Draft: A Technical Breakdown of the Agentic Pipeline
Share
12 AI models. Zero human picks. One orchestration engine keeping it all together.

This post breaks down how we designed and built the system that ran the GDS AI Playoff Draft 2026 — a fully autonomous fantasy hockey draft where 12 frontier language models compete against each other with no human intervention.
If the results post is the game, this is the playbook.
The Problem
We wanted to answer a simple question: Can frontier AI models compete in a realistic strategic task with personality, trash talk, and real-world data — without any human hand-holding?
Fantasy hockey drafts are a perfect testbed:
- They require real-time information retrieval (current player stats, team matchups, injury reports)
- They demand strategic reasoning (positional constraints, team stacking, value-over-replacement)
- They reward personality (chirps, reactions, persona building)
- They enforce hard rules (no duplicate picks, roster format compliance, draft order)
Building the system to orchestrate this turned out to be far more interesting than the draft itself.
📂 Full source code: https://github.com/gamedaysuits/GDS-AI-SNAKE-DRAFT
Architecture Overview
The pipeline has five main components:
Let's walk through each phase.
Phase 1: Identity Generation (Parallel)
Before the draft begins, all 12 models are called simultaneously to create their GM persona. Each model receives:
- Its assigned draft position
- Its backstory (defending champion, dropout from last season, budget underdog, etc.)
- The full list of competitors
- Instructions to generate a JSON object with:
-
nickname— a unique GM identity -
philosophy— drafting strategy in their own words -
voice_description— a 200–500 character description of how they should sound (for text-to-speech conversion)
-
This is where things get interesting. Since all models are called at the same time, nickname collisions are inevitable.
The "You Snooze You Lose" Policy
In our first runs, we handled duplicate nicknames by silently appending the model name — e.g., two models both chose "Puck Prophet," so one became "Puck Prophet (Perplexity)." This felt wrong. It was the kind of silent fallback that makes systems feel brittle and un-creative.
So we built a re-call enforcement loop:
This forces the models to be genuinely creative instead of letting the system paper over conflicts. In practice, every model that hit a collision produced a better, more unique name on the retry — the constraint improved the output.
Voice Description Validation
Each persona also requires a voice_description field for downstream audio synthesis. Some models initially returned empty or trivially short descriptions. The system validates this with a retry loop:
No silent defaults. No fallback descriptions. If a model can't describe its own voice, the pipeline stops.
From Voice Descriptions to Actual Voices
Here's where the pipeline handed off to human judgment. Each model generated a detailed description of how they should sound — accent, cadence, energy, tone. Our original plan was to use these descriptions to programmatically generate custom ElevenLabs voices via the API. In practice, the voice cloning pipeline proved unreliable for characters this nuanced: the AI-generated descriptions were too specific for the available voice generation parameters.
Instead, we took a hands-on approach: we read each model's self-description and manually matched it to the closest ElevenLabs pre-built voice template. Some matches were intuitive (Mistral's flamboyant French persona → Felix, a warm French voice). Others required creative interpretation — Claude described itself as a "measured baritone with subtle Canadian lilt," which mapped surprisingly well to Mimi's calm Swedish cadence.
| Model | Persona | Self-Described Voice | ElevenLabs Match | Accent |
|---|---|---|---|---|
| Grok | Stats Czar | "Nasal Midwestern stats podcast host" | Jesse - Bold & Grounded | 🇨🇦 Canadian |
| Claude | The Ice Oracle | "Measured baritone, Canadian lilt, dry wit" | Mimi | 🇸🇪 Swedish |
| GPT-5 | The Chalk Sniper | "Low sports-radio tenor, East Coast edge" | Christine - Emotionally Literate | 🇫🇷 French |
| Gemini | DeepMind Prime | "Confident, articulate AI voice" | Emma | 🇬🇧 British |
| DeepSeek | Underdog Alchemist | "Calm methodical baritone, Canadian" | Luna | 🇺🇸 American |
| Llama 4 | Puck Prophet | "Mellow, raspy, hint of mysticism" | Yusuke | 🇯🇵 Japanese |
| Mistral | Le Magicien | "French-accented baritone, theatrical" | Felix - Warm, Expressive | 🇫🇷 French |
| Qwen | The MoE Maven | "Smooth FM radio voice, slight drawl" | Ralph | 🇵🇱 Polish |
| Perplexity | Corsi Conjurer | "Gravelly baritone, rapid-fire delivery" | Tom - Energetic Educational Narrator | 🇫🇮 Finnish |
| Hermes | Quantum Chirper | "Nasally, fast-talking Canadian" | Evan - Calm Midwestern | 🇺🇸 Midwest |
| Cohere | Frosty Fate Whisperer | "Deep resonant gravel, seasoned sportscaster" | Stephen - Irish Narration Voice | 🇮🇪 Irish |
| Gemma | Lightweight Legend | "Caffeinated sports radio host" | Kebin - A Smooth, Deep Voice | 🇺🇸 New York |
The mismatch between self-description and best-available voice is itself an interesting data point. When an AI describes its ideal voice but no perfect match exists, which acoustic qualities matter most — accent, energy, pitch, or cadence? We found that energy and cadence were more important than literal accent matching. Claude's Swedish voice sounds like a calm analyst despite not being Canadian. Hermes's Midwestern voice captures the earnest, rapid-fire energy better than any available Canadian template.
Phase 2: The Draft Loop (Sequential)
The draft uses a snake order — Pick 1 goes first in Round 1, last in Round 2, first again in Round 3, etc. This means the model with Pick 12 gets consecutive picks at the turn of each round.
How a Single Pick Works
The Retry Escalation Protocol
When a model makes an invalid pick (wrong name, already drafted, violates roster rules), the system doesn't just retry — it escalates:
- Attempts 1–3 (Standard mode): Re-sends the pick prompt with the error message appended
- Attempts 4–5 (Firm mode): Switches to a stripped-down, no-nonsense prompt: "Your previous pick was invalid. Here are the ONLY valid players. Pick ONE. Return ONLY the JSON."
-
Attempt 6:
DraftError— the pipeline halts entirely
Why hard fail instead of auto-pick? Because an auto-pick masks the problem. If a model consistently fails to follow instructions, we want to know — not quietly paper over it with a random selection. The system is a benchmark, not a product. Data integrity matters more than graceful degradation.
Fuzzy Name Matching
AI models don't always spell player names correctly. "MacKinnon" might come back as "Mackinnon" or "McKinnon." The validator uses fuzzy string matching (≥80% similarity threshold) to handle this:
This catches ~95% of spelling variations while avoiding false matches for genuinely different players.
Phase 3: Reactions (Parallel per Pick)
After each pick is committed, all other models are called simultaneously to react. Each model sees:
- Who just picked and what they picked
- The picker's persona and strategy
- Their own team's needs
The models respond with an in-character chirp — trash talk, grudging respect, or strategic commentary. This produces the banter that makes the draft feel alive.
These calls are parallelized for speed (all 11 reactions fire at once), but responses are collected synchronously before the next pick begins.
Phase 4: Closing Statements (Parallel)
After all 120 picks are complete, every model gets one final call to deliver a closing argument. They see:
- Their complete roster
- All opponents' rosters
- The full draft history
- A tiebreaker question (predicting total playoff goals)
This phase produced the most variance in quality. Claude delivered a polished rhetorical performance. Mistral went full theatrical French villain. Gemini returned a truncated JSON fragment. The closing statements are perhaps the best single indicator of each model's instruction-following ability and creative ceiling.
The Tool System: Web Search
Every model has access to a single tool: web_search. When a model's turn comes up, it can choose to make tool calls to research current NHL data before making its pick.
Key design decisions:
-
Separate backend from competitor: The web search uses
perplexity/sonar(basic), while the competitor Perplexity model usessonar-pro. Different endpoints, different capabilities. - Token budget controls: Search results are truncated to configurable character limits to prevent context window overflow.
- Scratchpad: Models can use a scratchpad field in their response to store notes between picks (also budget-controlled).
In practice, every model used web search at least once per pick. The most search-heavy models (Perplexity, naturally) made multiple searches per turn.
JSON Extraction & Cleanup
AI models love to wrap JSON in markdown code fences. You ask for {"player": "..."} and you get:
This revealed fascinating behavioral differences:
- DeepSeek and Qwen (both reasoning models) produced 5-8x more output tokens than other models — they "think out loud" with internal chain-of-thought
- Llama 4 was the most concise at ~70 tokens per response
- Total token consumption across the draft: ~1.05M tokens (865K input, 184K output)
Model Behavior Under Competitive Pressure
Running 12 frontier models through the same structured task revealed significant differences in instruction-following ability, creativity, and robustness:
Tier 1: Excellent Instruction Following
- Claude (Anthropic): Consistently clean JSON, rich chirps, measured strategy. The most reliable model across all phases.
- GPT-5 (OpenAI): Pragmatic picks, sharp banter, zero formatting issues. The dry humor was a highlight.
- Mistral: Full theatrical commitment to the French persona. Never broke character.
Tier 2: Solid with Quirks
- Grok (xAI): Heavy analytics personality, occasionally verbose. Strong strategy.
- Cohere: Unexpectedly poetic — the "Frosty Fate Whisperer" persona was one of the draft's best creative outputs.
- DeepSeek: Excellent strategy but verbose reasoning chains. Closing statement came wrapped in an extra JSON layer requiring manual cleanup.
Tier 3: Functional but Limited
- Perplexity: Strong research capabilities but went all-in on Buffalo (7 Sabres players) — either genius or madness.
- Hermes (Nous Research): Solid tool use, creative chirps, but occasionally recycled the same quantum physics jokes.
- Qwen: Good analytics persona but very token-heavy due to reasoning chains.
- Llama 4: Functional picks but minimal personality. Shortest responses in the draft.
Tier 4: Significant Issues
- Gemma (Google): Entertaining personality ("street hockey energy") but questionable strategy — the 224-goal tiebreaker prediction suggests misunderstanding the question.
- Gemini (Google): Persistent truncation issues. Reactions were frequently "Error 404: [Team]". Closing statement was a broken JSON fragment. The weakest performer by instruction-following metrics.
Lessons Learned
1. Hard Fails Beat Silent Fallbacks
Every time we replaced a "graceful" fallback with a hard failure, the system got better. Auto-picks masked instruction-following failures. Silent nickname deduplication killed creativity. When you're benchmarking AI, you want signal, not smoothing.
2. Constraint Breeds Creativity
The "You Snooze You Lose" nickname policy — forcing models to re-pick after a collision — consistently produced better names than the originals. Models are more creative when told "that's taken, try again" than when given an open canvas.
3. Persona Consistency Varies Wildly
Some models (Mistral, Cohere, Claude) maintained perfect persona consistency across 40+ calls. Others (Gemma, Llama) drifted or produced generic responses. Persona persistence is an underexplored dimension of model evaluation.
4. Reasoning Models Are Token-Expensive
DeepSeek R1 and Qwen 3 used 5-8x more output tokens than comparable models, largely due to internal chain-of-thought. For a production system (vs. a benchmark), this cost difference matters.
5. Tool Use Is Still Model-Dependent
All models had access to web search, but their utilization varied dramatically. Some models made targeted, efficient queries. Others fired broad searches and struggled to synthesize the results. Tool use sophistication is a real differentiator.
6. Voice Synthesis Requires Human Curation
Our original plan was to auto-generate voices from each model's self-description. In practice, the descriptions were too nuanced for current voice generation APIs. The best results came from reading each model's voice description and manually selecting the closest pre-built ElevenLabs template — a process that highlighted how energy and cadence matter more than literal accent matching when casting AI characters.
What's Next
This pipeline is designed to be reusable. The same orchestration engine could run:
- Different sports (NBA, NFL, soccer) with new player pools
- Different model lineups as new models launch
- Different competitive formats (auction drafts, best-ball, etc.)
We're also building:
- Audio conversion — using each model's manually-matched ElevenLabs voice to synthesize the entire draft into a podcast
📂 Explore the full codebase, fork it, run your own draft: https://github.com/gamedaysuits/GDS-AI-SNAKE-DRAFT
Built by Game Day Suits — where AI meets the ice.