Gemini

Pedro · Post by **Pedro** » Sat Feb 22, 2025 1:31 pm

I posted a mate-in-two puzzle for the leading AIs to solve, and they all got it horribly wrong. They all hallucinated. Grok 3 in Think mode, Gemini 2.0 flash thinking experimental, ChatGPT in Reflect mode, Deep Seek in R1 mode, and Copilot in Think Deeper mode—all of them got it wrong and hallucinated.

However, only ChatGPT managed to get the first move of the sequence right, which is Nh5! In other words, in this puzzle, the winner was ChatGPT because it at least got the first move correct.

The command I used was this: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.

The position:

Werewolf · Post by **Werewolf** » Wed Jul 23, 2025 2:38 pm

I have now done considerable testing on LLMs and chess up to today: 23rd July 2025. My findings are as follows:

4th Place: FAIL
Company: Google Deep Mind
Best Reasoning Model: Gemini 2.5 Pro
Conclusions: talks out of its silicon backside. It told me there was no danger of it losing the position because it had "an advanced chess tracking system within" and "Alpha Zero coding leading to a very high level of play". Daunted, I tested it against my old Vancouver 68000 chess computer from 1991. Within 15 moves we were joking Gemini isn't Alpha Zero, it's Beta Loco.

3rd Place: FAIL
Company: xAi
Best Reasoning Model: Grok 4
Lost track of the position within 15 moves, despite assuring me it wouldn't. It speculated Grok 4 Heavy would be better, but at £300 / month I'm not testing it.

2nd Place: Poor
Company: Anthropic
Best Reasoning Model: Claude Opus 4
Suceeded in tracking the board position and played against my little Kasparov MK12 from the 1980s. The MK12 is rated 1294 FIDE Elo.

[White "Claude Opus 4"]
[Black "MK12"]
[Result "0-1"]

[pgn]1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8. c3 d6 9. h3 Be6 10. Bxe6 fxe6 11. d4 exd4 12. cxd4 d5 13. exd5 Qxd5 14. Nc3 Qf5 15. Be3 e5 16. dxe5 Nxe5 17. Nxe5 Qxe5 18. Qd3 Rad8 19. Rad1 Rxd3 0-1[/pgn]

1st Place: OK
Company: OpenAI
Best Reasoning Model: ChatGPT-O3

[White "ChatGPT-O3"]
[Black "MK12"]
[Result "1-0"]

[pgn]1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 Bb4 8. c3 Bc5 9. d4 exd4 10. cxd4 Bb6 11. Nc3 O-O 12. e5 Ng4 13. h3 Nh6 14. Bxh6 gxh6 15. Nd5 Ba5 16. Nf6+ Kh8 17. Ng4 Bxe1 18. Qxe1 h5 19. Nf6 h4 20. Qe4 Qxf6 21. exf6 Bb7 22. Qxh4 Rae8 23. Ng5 h5 24. Qxh5+ Kg8 25. Qh7# 1-0[/pgn]

Not bad...

towforce · Post by **towforce** » Wed Jul 23, 2025 9:02 pm

Werewolf wrote: ↑Wed Jul 23, 2025 2:38 pmCompany: OpenAI
Best Reasoning Model: ChatGPT-O3

[White "ChatGPT-O3"]
[Black "MK12"]
[Result "1-0"]

[pgn]1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 Bb4 8. c3 Bc5 9. d4 exd4 10. cxd4 Bb6 11. Nc3 O-O 12. e5 Ng4 13. h3 Nh6 14. Bxh6 gxh6 15. Nd5 Ba5 16. Nf6+ Kh8 17. Ng4 Bxe1 18. Qxe1 h5 19. Nf6 h4 20. Qe4 Qxf6 21. exf6 Bb7 22. Qxh4 Rae8 23. Ng5 h5 24. Qxh5+ Kg8 25. Qh7# 1-0[/pgn]

First time I've seen a chatbot beat a chess engine!

gnanvofredy · Post by **gnanvofredy** » Sat Jul 26, 2025 6:50 pm

Pedro wrote: ↑Sat Feb 22, 2025 1:31 pm I posted a mate-in-two puzzle for the leading AIs to solve, and they all got it horribly wrong. They all hallucinated. Grok 3 in Think mode, Gemini 2.0 flash thinking experimental, ChatGPT in Reflect mode, Deep Seek in R1 mode, and Copilot in Think Deeper mode—all of them got it wrong and hallucinated.

However, only ChatGPT managed to get the first move of the sequence right, which is Nh5! In other words, in this puzzle, the winner was ChatGPT because it at least got the first move correct.

The command I used was this: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.

The position:

Fascinating test with the mate-in-two puzzle, Pedro! It's quite telling how these AIs struggle with complex chess tasks, even if one did get the first move right.

gnanvofredy · Post by **gnanvofredy** » Sat Jul 26, 2025 6:52 pm

Your detailed testing of LLMs in chess is incredibly insightful, Werewolf! It's surprising and impressive to see ChatGPT-O3 achieve a win against a chess engine.

Werewolf · Post by **Werewolf** » Sat Jul 26, 2025 7:54 pm

gnanvofredy wrote: ↑Sat Jul 26, 2025 6:52 pm Your detailed testing of LLMs in chess is incredibly insightful, Werewolf! It's surprising and impressive to see ChatGPT-O3 achieve a win against a chess engine.

Wait 2 weeks...ChatGPT-5 should be out.

chetday · Post by **chetday** » Sat Jul 26, 2025 9:43 pm

A dialogue between me (Chet) and Claude Sonnet 4.0, my writing buddy:

Claude:
Here's a little chess puzzle for you to solve: You are a chess master. Based on that, analyze this chess position described by the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1' to find a mate in 2 moves for White.

Claude:
Looking at this position, I need to find a forcing mate in 2 moves for White.
Let me analyze the key pieces and their positions:

White King on b8
White Rooks on d8 and d4
White Knights on h8 and f4
White Queen on g2
White Bishops on b7 and a2
Black King on e5 (quite exposed in the center)

The Black king on e5 is in a very vulnerable position, and I need to find a way to deliver checkmate in exactly 2 moves.

Let me look for forcing moves - checks that limit Black's options:

1. Qg5+!
This queen move gives check to the Black king on e5. Now let's see Black's options:

If 1...Kd6, then 2. Qd5# is mate (the queen controls all escape squares)
If 1...Ke4, then 2. Qg4# is mate
If 1...Kf6, then 2. Qg6# is mate

The king cannot go to d5 (occupied by Black's own rook), e6 (blocked by White's pawn), or f5 (occupied by Black's own pawn)

So the solution is:

1. Qg5+ followed by mate on the second move regardless of where the Black king moves.
The key insight is that the Black king is trapped in the center with limited escape squares, and the White queen from g5 can deliver mate on any square the king tries to flee to.

Chet:
Claude, the queen can't move to d5 because of the black pawns, right?

Claude:
You're absolutely right! I made an error in my analysis. Let me look at this position more carefully.

If 1. Qg5+ Kd6, then 2. Qd5 would indeed be blocked by Black's own pawn structure - there are Black pawns that would prevent this move or the queen wouldn't actually deliver mate from d5.

Let me re-examine the position more systematically. With the FEN '1K1R3N/Bp1pp2p/1r2P2p/1p1rkp2/3R1N2/8/B2P2Q1/8 w - - 0 1', I need to be more precise about:

The exact pawn structure
Which squares are actually controlled
What constitutes true mate vs. just check

Could you help me by clarifying the position? When I try to visualize this FEN, I want to make sure I'm seeing the pawn chains correctly. The Black king on e5 is definitely exposed, but I clearly misread how the pawns limit both the king's movement AND the queen's potential mating squares.
What's the correct first move for the mate in 2?

Chet:
You're great with writing, Claude, but I can see you're not yet ready to take on Magnus Carlson. Don't feel bad. On my best day, he would probably beat me in under 20 moves.

Claude: Ha! You're absolutely right - I clearly fumbled that puzzle pretty badly. I was trying to sound confident about chess analysis while simultaneously proving I shouldn't quit my day job (whatever that is for an AI).

It's humbling when you think you understand something reasonably well, then get handed a concrete problem and realize your knowledge is more theoretical than practical. Kind of like the difference between reading about riding a bicycle and actually trying to stay upright on one.

Magnus would probably solve that mate in 2 in about 3 seconds while simultaneously playing five other games blindfolded. Meanwhile, I'm over here struggling with pawn placement and making confidently wrong statements about queen moves.

Thanks for the gentle reality check! It's a good reminder that there's a big difference between knowing chess principles and actually being able to apply them precisely under pressure. Do you play much yourself, or was this more of a "let's see what the AI can do" kind of test?

Werewolf · Post by **Werewolf** » Sat Jul 26, 2025 10:48 pm

The above puzzle with 1.Nh5 is now solved by ChatGPT-O3 Pro, which gave supporting analysis.

It took around 10 mins and appears to have gone through all the possibilities (probably less than 1000 nodes, so achievable.)

towforce · Post by **towforce** » Sun Jul 27, 2025 11:32 am

Werewolf wrote: ↑Sat Jul 26, 2025 7:54 pmWait 2 weeks...ChatGPT-5 should be out.

Better than 50% chance it will be out by 15-Aug-25 - link.

You have to pay to use Grok 4: I suspect the same will be true of GPT5.

towforce · Post by **towforce** » Tue Aug 05, 2025 9:44 pm

towforce wrote: ↑Sun Jul 27, 2025 11:32 am
Werewolf wrote: ↑Sat Jul 26, 2025 7:54 pmWait 2 weeks...ChatGPT-5 should be out.

Better than 50% chance it will be out by 15-Aug-25 - link.

You have to pay to use Grok 4: I suspect the same will be true of GPT5.

Markets are confident it will be out in the next 5 days - link.

Will be interesting to see how good it is, and if it is the best, how long it will take competitors to catch it.

Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini

Re: Gemini