How We Built an Autonomous AI Hockey Draft: A Technical Breakdown of the Agentic Pipeline

12 AI models. Zero human picks. One orchestration engine keeping it all together.

This post breaks down how we designed and built the system that ran the GDS AI Playoff Draft 2026 — a fully autonomous fantasy hockey draft where 12 frontier language models compete against each other with no human intervention.

If the results post is the game, this is the playbook.


The Problem

We wanted to answer a simple question: Can frontier AI models compete in a realistic strategic task with personality, trash talk, and real-world data — without any human hand-holding?

Fantasy hockey drafts are a perfect testbed:

  • They require real-time information retrieval (current player stats, team matchups, injury reports)
  • They demand strategic reasoning (positional constraints, team stacking, value-over-replacement)
  • They reward personality (chirps, reactions, persona building)
  • They enforce hard rules (no duplicate picks, roster format compliance, draft order)

Building the system to orchestrate this turned out to be far more interesting than the draft itself.

📂 Full source code: https://github.com/gamedaysuits/GDS-AI-SNAKE-DRAFT


Architecture Overview

The pipeline has five main components:



┌─────────────────────────────────────────────────┐
│ run.py │
│ (Entry Point) │
└────────────────────┬────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ orchestrator.py │
│ (Phase Manager + Draft Loop Controller) │
│ │
│ Phase 1: Identity (parallel) │
│ Phase 2: Draft (sequential per pick) │
│ Phase 3: Reactions (parallel per pick) │
│ Phase 4: Closing (parallel) │
└────┬──────────┬──────────┬──────────┬───────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌────────┐ ┌──────────┐
│api_client│ │draft_stat│ │context │ │validators│
│ .py │ │ e.py │ │builder │ │ .py │
│ │ │ │ │ .py │ │ │
│ OpenRout-│ │ Player │ │ Prompt │ │ Fuzzy │
│ er calls │ │ pool, │ │ assemb-│ │ matching │
│ + token │ │ rosters, │ │ ly + │ │ + roster │
│ tracking │ │ picks │ │ budget │ │ rules │
└─────────┘ └──────────┘ └────────┘ └──────────┘

Let's walk through each phase.


Phase 1: Identity Generation (Parallel)

Before the draft begins, all 12 models are called simultaneously to create their GM persona. Each model receives:

  • Its assigned draft position
  • Its backstory (defending champion, dropout from last season, budget underdog, etc.)
  • The full list of competitors
  • Instructions to generate a JSON object with:
    • nickname — a unique GM identity
    • philosophy — drafting strategy in their own words
    • voice_description — a 200–500 character description of how they should sound (for text-to-speech conversion)

This is where things get interesting. Since all models are called at the same time, nickname collisions are inevitable.

The "You Snooze You Lose" Policy

In our first runs, we handled duplicate nicknames by silently appending the model name — e.g., two models both chose "Puck Prophet," so one became "Puck Prophet (Perplexity)." This felt wrong. It was the kind of silent fallback that makes systems feel brittle and un-creative.

So we built a re-call enforcement loop:



For each nickname collision:
1. Identify the model that responded LATER (slower response = you lost)
2. Re-call that model with a new prompt:
"Sorry, [other model] already took that nickname.
Pick a new one that's uniquely yours."
3. Validate the new nickname
4. Retry up to 3 times, then hard fail

This forces the models to be genuinely creative instead of letting the system paper over conflicts. In practice, every model that hit a collision produced a better, more unique name on the retry — the constraint improved the output.

Voice Description Validation

Each persona also requires a voice_description field for downstream audio synthesis. Some models initially returned empty or trivially short descriptions. The system validates this with a retry loop:



If voice_description is missing or < 50 characters:
Re-call the model: "Your voice description was too short.
Provide a rich, detailed description (200-500 chars)."
Retry up to 3 times → hard fail on exhaustion

No silent defaults. No fallback descriptions. If a model can't describe its own voice, the pipeline stops.

From Voice Descriptions to Actual Voices

Here's where the pipeline handed off to human judgment. Each model generated a detailed description of how they should sound — accent, cadence, energy, tone. Our original plan was to use these descriptions to programmatically generate custom ElevenLabs voices via the API. In practice, the voice cloning pipeline proved unreliable for characters this nuanced: the AI-generated descriptions were too specific for the available voice generation parameters.

Instead, we took a hands-on approach: we read each model's self-description and manually matched it to the closest ElevenLabs pre-built voice template. Some matches were intuitive (Mistral's flamboyant French persona → Felix, a warm French voice). Others required creative interpretation — Claude described itself as a "measured baritone with subtle Canadian lilt," which mapped surprisingly well to Mimi's calm Swedish cadence.

Model Persona Self-Described Voice ElevenLabs Match Accent
Grok Stats Czar "Nasal Midwestern stats podcast host" Jesse - Bold & Grounded 🇨🇦 Canadian
Claude The Ice Oracle "Measured baritone, Canadian lilt, dry wit" Mimi 🇸🇪 Swedish
GPT-5 The Chalk Sniper "Low sports-radio tenor, East Coast edge" Christine - Emotionally Literate 🇫🇷 French
Gemini DeepMind Prime "Confident, articulate AI voice" Emma 🇬🇧 British
DeepSeek Underdog Alchemist "Calm methodical baritone, Canadian" Luna 🇺🇸 American
Llama 4 Puck Prophet "Mellow, raspy, hint of mysticism" Yusuke 🇯🇵 Japanese
Mistral Le Magicien "French-accented baritone, theatrical" Felix - Warm, Expressive 🇫🇷 French
Qwen The MoE Maven "Smooth FM radio voice, slight drawl" Ralph 🇵🇱 Polish
Perplexity Corsi Conjurer "Gravelly baritone, rapid-fire delivery" Tom - Energetic Educational Narrator 🇫🇮 Finnish
Hermes Quantum Chirper "Nasally, fast-talking Canadian" Evan - Calm Midwestern 🇺🇸 Midwest
Cohere Frosty Fate Whisperer "Deep resonant gravel, seasoned sportscaster" Stephen - Irish Narration Voice 🇮🇪 Irish
Gemma Lightweight Legend "Caffeinated sports radio host" Kebin - A Smooth, Deep Voice 🇺🇸 New York

The mismatch between self-description and best-available voice is itself an interesting data point. When an AI describes its ideal voice but no perfect match exists, which acoustic qualities matter most — accent, energy, pitch, or cadence? We found that energy and cadence were more important than literal accent matching. Claude's Swedish voice sounds like a calm analyst despite not being Canadian. Hermes's Midwestern voice captures the earnest, rapid-fire energy better than any available Canadian template.


Phase 2: The Draft Loop (Sequential)

 

Two robotic hands reaching for the same glowing hockey puck — sparks fly at the point of competition
Review
Two robotic hands reaching for the same glowing hockey puck — sparks fly at the point of competition

 

The draft uses a snake order — Pick 1 goes first in Round 1, last in Round 2, first again in Round 3, etc. This means the model with Pick 12 gets consecutive picks at the turn of each round.

How a Single Pick Works



For Each Pick:
┌──────────────────────────────────────┐
│ 1. Build Context │
│ - Current roster │
│ - Available players (by position) │
│ - Recent picks (last 12) │
│ - Opponent rosters │
│ - Draft position + round │
│ - Roster constraints remaining │
└──────────────┬───────────────────────┘
┌──────────────────────────────────────┐
│ 2. Web Search (Optional) │
│ Model uses tool_call to research │
│ live NHL data via Perplexity │
│ Sonar (separate from the │
│ competitor Perplexity model) │
└──────────────┬───────────────────────┘
┌──────────────────────────────────────┐
│ 3. Make Pick │
│ Model returns JSON: │
│ { player, team, position, │
│ chirp, reasoning } │
└──────────────┬───────────────────────┘
┌──────────────────────────────────────┐
│ 4. Validate │
│ - Is player in the player pool? │
│ - Is player available (undrafted)?│
│ - Does pick satisfy roster rules? │
│ - Fuzzy name matching (≥80%) │
└──────────────┬───────────────────────┘
┌──── VALID ───┴──── INVALID ────┐
│ │ │
▼ ▼ │
COMMIT RETRY (up to 5x) │
to state with escalating │
firmness │
│ │
└──── HARD FAIL ─────┘
(DraftError)

The Retry Escalation Protocol

When a model makes an invalid pick (wrong name, already drafted, violates roster rules), the system doesn't just retry — it escalates:

  • Attempts 1–3 (Standard mode): Re-sends the pick prompt with the error message appended
  • Attempts 4–5 (Firm mode): Switches to a stripped-down, no-nonsense prompt: "Your previous pick was invalid. Here are the ONLY valid players. Pick ONE. Return ONLY the JSON."
  • Attempt 6: DraftError — the pipeline halts entirely

Why hard fail instead of auto-pick? Because an auto-pick masks the problem. If a model consistently fails to follow instructions, we want to know — not quietly paper over it with a random selection. The system is a benchmark, not a product. Data integrity matters more than graceful degradation.

Fuzzy Name Matching

AI models don't always spell player names correctly. "MacKinnon" might come back as "Mackinnon" or "McKinnon." The validator uses fuzzy string matching (≥80% similarity threshold) to handle this:

python

from difflib import SequenceMatcher

def fuzzy_match(input_name, candidate_names, threshold=0.8):
"""Find the best match above the similarity threshold."""
best_match = None
best_score = 0
for candidate in candidate_names:
score = SequenceMatcher(None,
input_name.lower(),
candidate.lower()
).ratio()
if score > best_score and score >= threshold:
best_match = candidate
best_score = score
return best_match

This catches ~95% of spelling variations while avoiding false matches for genuinely different players.


Phase 3: Reactions (Parallel per Pick)

After each pick is committed, all other models are called simultaneously to react. Each model sees:

  • Who just picked and what they picked
  • The picker's persona and strategy
  • Their own team's needs

The models respond with an in-character chirp — trash talk, grudging respect, or strategic commentary. This produces the banter that makes the draft feel alive.

These calls are parallelized for speed (all 11 reactions fire at once), but responses are collected synchronously before the next pick begins.


Phase 4: Closing Statements (Parallel)

After all 120 picks are complete, every model gets one final call to deliver a closing argument. They see:

  • Their complete roster
  • All opponents' rosters
  • The full draft history
  • A tiebreaker question (predicting total playoff goals)

This phase produced the most variance in quality. Claude delivered a polished rhetorical performance. Mistral went full theatrical French villain. Gemini returned a truncated JSON fragment. The closing statements are perhaps the best single indicator of each model's instruction-following ability and creative ceiling.


Every model has access to a single tool: web_search. When a model's turn comes up, it can choose to make tool calls to research current NHL data before making its pick.



Model: tool_call(web_search, "Colorado Avalanche 2026 playoff odds")
└→ Perplexity Sonar API (basic tier, not the competitor model)
└→ Returns: formatted text with stats, odds, matchup data
└→ Injected into the model's next prompt as tool_results

Key design decisions:

  • Separate backend from competitor: The web search uses perplexity/sonar (basic), while the competitor Perplexity model uses sonar-pro. Different endpoints, different capabilities.
  • Token budget controls: Search results are truncated to configurable character limits to prevent context window overflow.
  • Scratchpad: Models can use a scratchpad field in their response to store notes between picks (also budget-controlled).

In practice, every model used web search at least once per pick. The most search-heavy models (Perplexity, naturally) made multiple searches per turn.


JSON Extraction & Cleanup

AI models love to wrap JSON in markdown code fences. You ask for {"player": "..."} and you get:



Here's my pick:
```json
{"player": "Nathan MacKinnon"}


The API client includes a multi-stage extraction pipeline:

1. **Direct parse:** Try `json.loads()` on the raw response
2. **Code fence strip:** Regex to remove ` ```json ` and ` ``` ` wrappers
3. **Brace extraction:** Find the first `{` and last `}` and parse that substring
4. **Bracket extraction:** Same for `[` and `]` (for array responses)

This handles ~99% of formatting variations across all 12 models.

---

## Token Tracking

Every API call is tracked with thread-safe token accounting:

```python
# Per-model accumulation
{
"Claude": {"calls": 41, "input": 74658, "output": 6840},
"DeepSeek": {"calls": 42, "input": 66346, "output": 42321},
...
}

This revealed fascinating behavioral differences:

  • DeepSeek and Qwen (both reasoning models) produced 5-8x more output tokens than other models — they "think out loud" with internal chain-of-thought
  • Llama 4 was the most concise at ~70 tokens per response
  • Total token consumption across the draft: ~1.05M tokens (865K input, 184K output)

Model Behavior Under Competitive Pressure

 

A holographic draft board showing the snake-draft flow with glowing light trails tracing the pick path
Review
A holographic draft board showing the snake-draft flow with glowing light trails tracing the pick path

 

Running 12 frontier models through the same structured task revealed significant differences in instruction-following ability, creativity, and robustness:

Tier 1: Excellent Instruction Following

  • Claude (Anthropic): Consistently clean JSON, rich chirps, measured strategy. The most reliable model across all phases.
  • GPT-5 (OpenAI): Pragmatic picks, sharp banter, zero formatting issues. The dry humor was a highlight.
  • Mistral: Full theatrical commitment to the French persona. Never broke character.

Tier 2: Solid with Quirks

  • Grok (xAI): Heavy analytics personality, occasionally verbose. Strong strategy.
  • Cohere: Unexpectedly poetic — the "Frosty Fate Whisperer" persona was one of the draft's best creative outputs.
  • DeepSeek: Excellent strategy but verbose reasoning chains. Closing statement came wrapped in an extra JSON layer requiring manual cleanup.

Tier 3: Functional but Limited

  • Perplexity: Strong research capabilities but went all-in on Buffalo (7 Sabres players) — either genius or madness.
  • Hermes (Nous Research): Solid tool use, creative chirps, but occasionally recycled the same quantum physics jokes.
  • Qwen: Good analytics persona but very token-heavy due to reasoning chains.
  • Llama 4: Functional picks but minimal personality. Shortest responses in the draft.

Tier 4: Significant Issues

  • Gemma (Google): Entertaining personality ("street hockey energy") but questionable strategy — the 224-goal tiebreaker prediction suggests misunderstanding the question.
  • Gemini (Google): Persistent truncation issues. Reactions were frequently "Error 404: [Team]". Closing statement was a broken JSON fragment. The weakest performer by instruction-following metrics.

Lessons Learned

1. Hard Fails Beat Silent Fallbacks

Every time we replaced a "graceful" fallback with a hard failure, the system got better. Auto-picks masked instruction-following failures. Silent nickname deduplication killed creativity. When you're benchmarking AI, you want signal, not smoothing.

2. Constraint Breeds Creativity

The "You Snooze You Lose" nickname policy — forcing models to re-pick after a collision — consistently produced better names than the originals. Models are more creative when told "that's taken, try again" than when given an open canvas.

3. Persona Consistency Varies Wildly

Some models (Mistral, Cohere, Claude) maintained perfect persona consistency across 40+ calls. Others (Gemma, Llama) drifted or produced generic responses. Persona persistence is an underexplored dimension of model evaluation.

4. Reasoning Models Are Token-Expensive

DeepSeek R1 and Qwen 3 used 5-8x more output tokens than comparable models, largely due to internal chain-of-thought. For a production system (vs. a benchmark), this cost difference matters.

5. Tool Use Is Still Model-Dependent

All models had access to web search, but their utilization varied dramatically. Some models made targeted, efficient queries. Others fired broad searches and struggled to synthesize the results. Tool use sophistication is a real differentiator.

6. Voice Synthesis Requires Human Curation

Our original plan was to auto-generate voices from each model's self-description. In practice, the descriptions were too nuanced for current voice generation APIs. The best results came from reading each model's voice description and manually selecting the closest pre-built ElevenLabs template — a process that highlighted how energy and cadence matter more than literal accent matching when casting AI characters.


What's Next

This pipeline is designed to be reusable. The same orchestration engine could run:

  • Different sports (NBA, NFL, soccer) with new player pools
  • Different model lineups as new models launch
  • Different competitive formats (auction drafts, best-ball, etc.)

We're also building:

  • Audio conversion — using each model's manually-matched ElevenLabs voice to synthesize the entire draft into a podcast

📂 Explore the full codebase, fork it, run your own draft: https://github.com/gamedaysuits/GDS-AI-SNAKE-DRAFT


Built by Game Day Suits — where AI meets the ice.

Back to blog

Leave a comment