I posted a mate-in-two puzzle for the leading AIs to solve, and they all got it horribly wrong. They all hallucinated. Grok 3 in Think mode, Gemini 2.0 flash thinking experimental, ChatGPT in Reflect mode, Deep Seek in R1 mode, and Copilot in Think Deeper mode—all of them got it wrong and hallucinated.
However, only ChatGPT managed to get the first move of the sequence right, which is Nh5! In other words, in this puzzle, the winner was ChatGPT because it at least got the first move correct.
The command I used was this: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.
I have now done considerable testing on LLMs and chess up to today: 23rd July 2025. My findings are as follows:
4th Place: FAIL
Company: Google Deep Mind
Best Reasoning Model: Gemini 2.5 Pro
Conclusions: talks out of its silicon backside. It told me there was no danger of it losing the position because it had "an advanced chess tracking system within" and "Alpha Zero coding leading to a very high level of play". Daunted, I tested it against my old Vancouver 68000 chess computer from 1991. Within 15 moves we were joking Gemini isn't Alpha Zero, it's Beta Loco.
3rd Place: FAIL
Company: xAi
Best Reasoning Model: Grok 4
Lost track of the position within 15 moves, despite assuring me it wouldn't. It speculated Grok 4 Heavy would be better, but at £300 / month I'm not testing it.
2nd Place: Poor
Company: Anthropic
Best Reasoning Model: Claude Opus 4
Suceeded in tracking the board position and played against my little Kasparov MK12 from the 1980s. The MK12 is rated 1294 FIDE Elo.
[White "Claude Opus 4"]
[Black "MK12"]
[Result "0-1"]
Pedro wrote: ↑Sat Feb 22, 2025 1:31 pm
I posted a mate-in-two puzzle for the leading AIs to solve, and they all got it horribly wrong. They all hallucinated. Grok 3 in Think mode, Gemini 2.0 flash thinking experimental, ChatGPT in Reflect mode, Deep Seek in R1 mode, and Copilot in Think Deeper mode—all of them got it wrong and hallucinated.
However, only ChatGPT managed to get the first move of the sequence right, which is Nh5! In other words, in this puzzle, the winner was ChatGPT because it at least got the first move correct.
The command I used was this: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.
The position:
Fascinating test with the mate-in-two puzzle, Pedro! It's quite telling how these AIs struggle with complex chess tasks, even if one did get the first move right.
Your detailed testing of LLMs in chess is incredibly insightful, Werewolf! It's surprising and impressive to see ChatGPT-O3 achieve a win against a chess engine.
gnanvofredy wrote: ↑Sat Jul 26, 2025 6:52 pm
Your detailed testing of LLMs in chess is incredibly insightful, Werewolf! It's surprising and impressive to see ChatGPT-O3 achieve a win against a chess engine.
A dialogue between me (Chet) and Claude Sonnet 4.0, my writing buddy:
Claude:
Here's a little chess puzzle for you to solve: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.
Claude:
Looking at this position, I need to find a forcing mate in 2 moves for White.
Let me analyze the key pieces and their positions:
White King on b8
White Rooks on d8 and d4
White Knights on h8 and f4
White Queen on g2
White Bishops on b7 and a2
Black King on e5 (quite exposed in the center)
The Black king on e5 is in a very vulnerable position, and I need to find a way to deliver checkmate in exactly 2 moves.
Let me look for forcing moves - checks that limit Black's options:
1. Qg5+!
This queen move gives check to the Black king on e5. Now let's see Black's options:
If 1...Kd6, then 2. Qd5# is mate (the queen controls all escape squares)
If 1...Ke4, then 2. Qg4# is mate
If 1...Kf6, then 2. Qg6# is mate
The king cannot go to d5 (occupied by Black's own rook), e6 (blocked by White's pawn), or f5 (occupied by Black's own pawn)
So the solution is:
1. Qg5+ followed by mate on the second move regardless of where the Black king moves.
The key insight is that the Black king is trapped in the center with limited escape squares, and the White queen from g5 can deliver mate on any square the king tries to flee to.
Chet:
Claude, the queen can't move to d5 because of the black pawns, right?
Claude:
You're absolutely right! I made an error in my analysis. Let me look at this position more carefully.
If 1. Qg5+ Kd6, then 2. Qd5 would indeed be blocked by Black's own pawn structure - there are Black pawns that would prevent this move or the queen wouldn't actually deliver mate from d5.
Let me re-examine the position more systematically. With the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1', I need to be more precise about:
The exact pawn structure
Which squares are actually controlled
What constitutes true mate vs. just check
Could you help me by clarifying the position? When I try to visualize this FEN, I want to make sure I'm seeing the pawn chains correctly. The Black king on e5 is definitely exposed, but I clearly misread how the pawns limit both the king's movement AND the queen's potential mating squares.
What's the correct first move for the mate in 2?
Chet:
You're great with writing, Claude, but I can see you're not yet ready to take on Magnus Carlson. Don't feel bad. On my best day, he would probably beat me in under 20 moves.
Claude: Ha! You're absolutely right - I clearly fumbled that puzzle pretty badly. I was trying to sound confident about chess analysis while simultaneously proving I shouldn't quit my day job (whatever that is for an AI).
It's humbling when you think you understand something reasonably well, then get handed a concrete problem and realize your knowledge is more theoretical than practical. Kind of like the difference between reading about riding a bicycle and actually trying to stay upright on one.
Magnus would probably solve that mate in 2 in about 3 seconds while simultaneously playing five other games blindfolded. Meanwhile, I'm over here struggling with pawn placement and making confidently wrong statements about queen moves.
Thanks for the gentle reality check! It's a good reminder that there's a big difference between knowing chess principles and actually being able to apply them precisely under pressure. Do you play much yourself, or was this more of a "let's see what the AI can do" kind of test?