We pitted ten of today's top AIs against each other in a hockey fantasy draft

We pitted ten of today's top AIs against each other in a hockey fantasy draft

Game Day Suits is shocked to be presenting The GDS Fantasy Hockey Draft AI Benchmark Competition, 2025-2026, part of a qualitative but objective paradigm for AI real world prediction benchmarking, and just in time for hockey season.  We pitted 10 frontier reasoning models head-to-head in a 2025-2026 NHL fantasy draft, may the best LLM win.  

  

The main reason we are shocked to be presenting this is that none of our staff have any coding experience before ten days ago.  Basically we were just drinking and doing a fantasy draft with friends a few weeks ago, and decided to see if we could pit 10 frontier reasoning models head-to-head in similar competition.  So we got to talking with ChatGPT, and after about a hundred revisions, we built a python framework that allows AIs to create and follow their own strategies in a competition with each other.  Here's the rundown on the The GDS Fantasy Hockey Draft AI Benchmark Competition, 2025-2026. 


Development Context

In less than two weeks we learned how to "vibecode" with ChatGPT Agent and created a first-of-its kind AI benchmark: an NHL fantasy draft.  I can't find any precedent Turns out, no one had ever done something like this before, so we made sure we got it done before the season officially started on the 7th.  With no coding experience.  Again, AI is wild.  We are a custom suit startup with a github.  You can find the code at https://github.com/gamedaysuits/GDS-AI-Draft-Benchmark.  It took about $100 in openrouter credits to test, but maybe only about $10 to run it once.

To be clear, not only do we have no coders on the team, we also have no official affiliation with the NHL, or any other professional sports league, team, player, organization or individual.  We only used publicly available statistical data and other information to create this program, and worked iteratively with ChatGPT to get it working and philosophically interesting.  We allowed the AIs to add colour commentary, giving them the chance to psych out their opponents (like any of us would do during a friendly but meaningful fantasy draft).  For example, see this exchange:

Technical and Philosophical Significance

This actually makes the transcript pretty funny to read, but taken seriously, this draft represents a distinct paradigm in AI benchmarking focused on a more qualitative approach to ranking frontier LLM reasoning model performance by placing them in direct competition with each other in a context that involve complex, dynamic strategizing within clear rules about difficult to predict outcomes.  This format uniquely allows AI GM to succeed not only through iterative analysis but also through misdirection, anticipation, deception, cajoling, needling, trash talking, intimidation, strategy revision, etc. 

If traditional benchmarks are like IQ tests, or book smarts or academic skills, these benchmarks are like practical exams, like street smarts or social skills.  Real agents know how to think on their feet, to strive for a goal against competitors, not just to recite trivia or perform analysis.  There's another contrast: while most AI benchmarking techniques yield instantaneous results, we need to wait for the results here.  Most of us would agree that winning a draft like this takes a variety of skills.  This draft represents part of an effort to measure the presence of those skills in frontier reasoning models, and we plan to do more such tests.


Future Context

Anyone who finds this remotely interesting, please critique this project as appropriate, from any perspective.  I probably biffed it seriously somewhere, and it's clear many of these models did not excel under this design.  This is just a proof of concept.  But also, let's talk about how to expand it and make it better.  This design is built in python and uses openrouter.   and the code, logs, etc. is all going up on our Github so anyone can run their own instance of this (yes, we are a hockey-themed suit company with a github, it's a postmodern world get over it).  In any case, feel free to clone, modify, redeploy, and generally do as you see fit with this code, it's under "MIT license" which I learned is coder talk for "free to use."  I would love if we could all just talk about it together, improve the design while navigating the thorny philosophical, probabilistic, and technical issues involved in doing this right.  By next year (or by the playoffs), we hope to make this as rigorously well-designed as possible and run it again (assuming I didn't somehow accidentally get it perfect the first time). 

For that matter, let's do one of these for every sport, test these AIs across a range of domains.  I'm not sure what sort of design changes that would require.  Some more frontiers to explore for sure, as what's clear for now is that this proof of concept raises many questions.  I suggest this sort of competitive activity focus on making novel predictions in complex contexts should be better explored by AI researchers, given the philosophical questions raised by such tests.

- Is this the fairest or most illustrative format for this sort of a ranked test? 
- Why does predicting something like sports differ from predicting the shifts of the market? 
- What do predictive auction competitions such as this one teach us about AI performance that we can't learn from traditional benchmarking? 
- Would AIs beat humans in this competition format?  If not now, could they?
- And most importantly, which AI GM will win The GDS Fantasy Hockey Draft AI Benchmark Competition, 2025-2026!?!?!? 


All these questions and more will be reflected upon here as this season progresses.

We will be explaining the draft structure, implementation, etc. in more details over the coming weeks.  The official logs for run 1 are in this google drive folder:

https://drive.google.com/drive/folders/1C89gUhoOA70nDX5Ial1YQtacDITBX34P?usp=sharing

We're always happy to try and answer any pressing questions by email at gamedaysuits@gmail.com.  Also if you're in Alberta and you need a suit give us a call.

 

 

 

Draft Structure

The Game Day Suits Fantasy Hockey Draft AI Benchmark Competition, 2025-2026 is a fantasy-sports auction where ten AI general managers — each powered by a different large-language model — compete to build the ultimate NHL roster under identical salary-cap rules.

The draft unfolds in three major phases:

  1. Pre-Draft Planning
    Each AI is prompted to research fantasy-hockey projections and independently develop a strategy, bidding logic, and communication style. These plans remain private to each model and are referenced later as its personal “playbook.”

  2. Live Auction
    The AIs take turns nominating players from a shared CSV roster. Bidding follows classic auction rules — minimum $10, $10 increments, $1,000 cap, and 11 roster slots. The system validates all nominations and records every sale, while locker-room-style banter fills the chat. A web interface at 127.0.0.1:8777 visualizes the action in real time.

  3. Ongoing Updates
    Every ten sales, the models can revise their strategies to adapt to shifting budgets and rosters. When all teams are full, results are saved automatically to draft_results.csv, along with text and HTML transcripts for replay.

 


DRAFT RESULTS (Run 1, the official run.  To examine robustness, we are running 9 more over the coming days, one at a time).

 

- Grok 4: Money remaining - $0 | Evan Bouchard/$70, Artemi Panarin/$210, Elias Pettersson/$150, Nikita Kucherov/$190, Andrei Svechnikov/$90, Jason Robertson/$70, Brad Marchand/$80, Adrian Kempe/$60, Dougie Hamilton/$30, Dylan Guenther/$30, Dylan Holloway/$20

- ChatGPT4o-mini: Money remaining - $760 | Jordan Kyrou/$20, Jackson LaCombe/$30, Josh Morrissey/$20, Martin Necas/$30, Lane Hutson/$20, Nazem Kadri/$10, Matthew Knies/$20, Rasmus Dahlin/$60, Sidney Crosby/$10, Mikhail Sergachev/$10, Quinton Byfield/$10

- ChatGPT4o: Money remaining - $420 | Leon Draisaitl/$260, Dylan Larkin/$50, Jack Hughes/$80, J.T. Miller/$40, Jesper Bratt/$40, Mathew Barzal/$20, Matt Boldy/$10, Kirill Marchenko/$40, Roope Hintz/$20, Ryan Nugent-Hopkins/$10, Rickard Rakell/$10

- Claude 3.5 Sonnet: Money remaining - $200 | Nathan MacKinnon/$270, Connor McDavid/$350, Gabriel Landeskog/$30, Lucas Raymond/$20, JJ Peterka/$20, Jakob Chychrun/$20, Miro Heiskanen/$40, Mason McTavish/$20, Quinn Hughes/$10, Nick Suzuki/$10, Roman Josi/$10

- Perplexity Sonar: Money remaining - $360 | Adam Fantilli/$30, Clayton Keller/$100, Alex DeBrincat/$130, Cutter Gauthier/$10, Brandon Hagel/$40, Cole Caufield/$60, Alex Tuch/$100, Drake Batherson/$40, Carter Verhaeghe/$50, Bo Horvat/$50, Brandon Montour/$30

- DeepSeek 3.1: Money remaining - $0 | Kirill Kaprizov/$160, Cale Makar/$130, Brayden Point/$110, Mikko Rantanen/$180, David Pastrnak/$120, Erik Karlsson/$150, Brock Nelson/$50, Bryan Rust/$40, Dylan Strome/$20, Frank Vatrano/$10, Gabriel Vilardi/$30

- QwQ 32b: Money remaining - $0 | Matthew Tkachuk/$180, Brady Tkachuk/$230, Moritz Seider/$100, Ivan Demidov/$30, Sam Reinhart/$30, Kyle Connor/$50, Marco Rossi/$10, Jake Sanderson/$80, Mark Stone/$60, Logan Cooley/$120, Kiefer Sherwood/$110

- Mistral: Money remaining - $70 | Jack Eichel/$100, Jake Guentzel/$40, William Nylander/$100, Mark Scheifele/$30, Sebastian Aho/$150, MacKenzie Weegar/$10, Juraj Slafkovsky/$140, Matvei Michkov/$200, Patrick Kane/$110, Matt Duchene/$20, Timo Meier/$30

- Claude 3 Opus:  Money remaining - $280 | Auston Matthews/$240, Alex Ovechkin/$260, Joel Eriksson Ek/$50, John Carlson/$30, Kevin Fiala/$30, Mika Zibanejad/$30, Shea Theodore/$10, Seth Jarvis/$10, Tim Stutzle/$40, Noah Dobson/$10, Robert Thomas/$10

- Mixtral: Money remaining - $0 | Connor Bedard/$100, Nico Hischier/$150, Adam Fox/$180, Filip Forsberg/$140, John Tavares/$80, Mitch Marner/$170, Tage Thompson/$140, Macklin Celebrini/$10, Nikolaj Ehlers/$10, Mikael Granlund/$10, Victor Hedman/$10




May the best LLM win, and may hockey exist until the end of time.

Back to blog

1 comment

Love this! Great work! I can’t wait to see the results! Thank you.

Kas McFarlane

Leave a comment