It's very odd GPT-01 (preview) and GPT-01 Mini aren't even mentioned. They are both superior to GPT-4O.
Still, this is interesting.
			
			
									
						
										
						Gemini
Moderator: Ras
- 
				Werewolf
- Posts: 2052
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Gemini
I decided to test this for myself.
I gave these two models the same coding test:
GPT-01 (preview)
Llama-3.1-Nemotron-70B-Instruct-HF
Each model was given a Connect 4 game (you drop discs into slots to make 4 in a row) codebase about 450 lines long. It uses a bitboard representation, is single threaded, no pruning just brute force. The codebase had 50 errors in it. I asked both models to fix the errors and produce fully working code.
GPT-01 (preview) needed a few goes but got there. Llama-3.1-Nemotron-70B-Instruct-HF failed and made code with 55 errors in it. It also codes at about 1/3 the speed.
			
			
									
						
										
						I gave these two models the same coding test:
GPT-01 (preview)
Llama-3.1-Nemotron-70B-Instruct-HF
Each model was given a Connect 4 game (you drop discs into slots to make 4 in a row) codebase about 450 lines long. It uses a bitboard representation, is single threaded, no pruning just brute force. The codebase had 50 errors in it. I asked both models to fix the errors and produce fully working code.
GPT-01 (preview) needed a few goes but got there. Llama-3.1-Nemotron-70B-Instruct-HF failed and made code with 55 errors in it. It also codes at about 1/3 the speed.
- 
				towforce  
- Posts: 12567
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
Thank you for this interesting comparison!
Human chess is partly about tactics and strategy, but mostly about memory
			
						- 
				Werewolf
- Posts: 2052
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Gemini
As I see it now we have these main competitors in roughly this order:
1. OpenAI with GPT-01 (preview). Full version coming soon and rumoured to be better.
2. Anthropic with Claude 3.5 Sonnet. Not really no.2 anymore, but so many people have left OpenAI to join Anthropic and Amazon are funding them so anything is possible. We're due an update here which could be big.
3. Google (and Deep Mind) with Gemini 1.5 Pro. This seems to be regularly updated and Deep Mind cannot be underestimated.
4. Elon Musk's Grok. Grok was laughed at in version 1 but version 2 is decent and version 3 has apparenly more hardware than competitors to train it. Version 3 is due in December 2024. Even Nvidia's boss thinks Elon is has done amazingly well:
https://www.tomshardware.com/pc-compone ... es-4-years
5.Now Nvidia who clearly have the hardware (and the only ones to be able to access next gen Blackwell) and they like to win.
			
			
									
						
										
						1. OpenAI with GPT-01 (preview). Full version coming soon and rumoured to be better.
2. Anthropic with Claude 3.5 Sonnet. Not really no.2 anymore, but so many people have left OpenAI to join Anthropic and Amazon are funding them so anything is possible. We're due an update here which could be big.
3. Google (and Deep Mind) with Gemini 1.5 Pro. This seems to be regularly updated and Deep Mind cannot be underestimated.
4. Elon Musk's Grok. Grok was laughed at in version 1 but version 2 is decent and version 3 has apparenly more hardware than competitors to train it. Version 3 is due in December 2024. Even Nvidia's boss thinks Elon is has done amazingly well:
https://www.tomshardware.com/pc-compone ... es-4-years
5.Now Nvidia who clearly have the hardware (and the only ones to be able to access next gen Blackwell) and they like to win.
- 
				towforce  
- Posts: 12567
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
Werewolf wrote: ↑Fri Oct 18, 2024 11:20 pm As I see it now we have these main competitors in roughly this order:
1. OpenAI with GPT-01 (preview). Full version coming soon and rumoured to be better.
2. Anthropic with Claude 3.5 Sonnet. Not really no.2 anymore, but so many people have left OpenAI to join Anthropic and Amazon are funding them so anything is possible. We're due an update here which could be big.
3. Google (and Deep Mind) with Gemini 1.5 Pro. This seems to be regularly updated and Deep Mind cannot be underestimated.
4. Elon Musk's Grok. Grok was laughed at in version 1 but version 2 is decent and version 3 has apparenly more hardware than competitors to train it. Version 3 is due in December 2024. Even Nvidia's boss thinks Elon is has done amazingly well:
https://www.tomshardware.com/pc-compone ... es-4-years
5.Now Nvidia who clearly have the hardware (and the only ones to be able to access next gen Blackwell) and they like to win.
Thank you for this thoughtful list. We might be referring back to it.
Considering there are only a few competitors at the top of this game, the speed at which it is moving is astounding: OpenAI claim (very reasonably) that GPT-o1 has reached human level reasoning skills (something that 50% of the human population have not!). This video:
1. The astonishing speed at which AI has been moving since the GPT-3.5 breakthrough in November 2022
2. Open AI's 5 level model (3.5 was level 1, o1 is at level 2)
3. Even at the same level as humans, machines win because they're faster
4. The astonishing amount of electricity AI is consuming
5. The danger of AI destroying humanity (in other threads, I have stated my case that there's an effective upper limit to intelligence, so I'm not worried about this)
Human chess is partly about tactics and strategy, but mostly about memory
			
						- 
				towforce  
- Posts: 12567
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
A forthcoming version of Claude will be able to use your computer (not good news for software testers!   ).
 ).
			
			
									
						
							 ).
 ).Human chess is partly about tactics and strategy, but mostly about memory
			
						- 
				Werewolf
- Posts: 2052
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Gemini
Tested it. 
Claude Sonnet 3.5 (new - whatever that means) is a step up for coding. I'd say it's now close to GPT-01 (preview) / GPT-01 Mini.
However, neither of them are good enough to write a chess program yet. I'm debugging their code on a Connect 4 program and that's hard enough.
			
			
									
						
										
						Claude Sonnet 3.5 (new - whatever that means) is a step up for coding. I'd say it's now close to GPT-01 (preview) / GPT-01 Mini.
However, neither of them are good enough to write a chess program yet. I'm debugging their code on a Connect 4 program and that's hard enough.
- 
				towforce  
- Posts: 12567
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
Without doubt, it won't be long before chatbots can write good working programs to play complex games in just a few seconds.
Meantime, maybe try the following (I haven't tried it, no idea whether it will actually help or not):
1. Ask the chatbot to come up with the classes and methods needed to build a complex program
2. Ask it to build one of the classes
3. Ask it to build tests for each one of the class's methods
4. Tell it where one of the tests failed, and ask it to correct the code
Maybe it might be possible to use the API in a program, or to use some kind of automated template, to get working code from a requirement with some variation of this idea?
Human chess is partly about tactics and strategy, but mostly about memory
			
						- 
				towforce  
- Posts: 12567
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
A popular question for chatbots doing the rounds right now:
What is the smallest integer whose square is between 5 and 17?
The top chatbots (Gemini, Claude and ChatGPT) all give the wrong answer - they said "3".
I re-prompted with "You forgot to check the squares of negative integers."
All three apologised and recalculated. ChatGPT and Claude then both said -3. The maths hero was Gemini, who then said -4. I am using Gemini Advanced, which may be an unfair advantage.
Worst of all was pi.ai, which still gave "3" even after being re-prompted. But different chatbots have different strengths, and pi is known for its emotional intelligence (so a good choice for coaching if you fear negative feedback).
To be fair to the chatbots, most humans give "3" as the answer as well - and chatbots are language models, not CAS (computer algebra systems).
			
			
									
						
							What is the smallest integer whose square is between 5 and 17?
The top chatbots (Gemini, Claude and ChatGPT) all give the wrong answer - they said "3".
I re-prompted with "You forgot to check the squares of negative integers."
All three apologised and recalculated. ChatGPT and Claude then both said -3. The maths hero was Gemini, who then said -4. I am using Gemini Advanced, which may be an unfair advantage.
Worst of all was pi.ai, which still gave "3" even after being re-prompted. But different chatbots have different strengths, and pi is known for its emotional intelligence (so a good choice for coaching if you fear negative feedback).
To be fair to the chatbots, most humans give "3" as the answer as well - and chatbots are language models, not CAS (computer algebra systems).
Human chess is partly about tactics and strategy, but mostly about memory
			
						- 
				Werewolf
- Posts: 2052
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Gemini
ChatGPT 4O - failed.towforce wrote: ↑Thu Oct 24, 2024 5:42 pm A popular question for chatbots doing the rounds right now:
What is the smallest integer whose square is between 5 and 17?
The top chatbots (Gemini, Claude and ChatGPT) all give the wrong answer - they said "3".
I re-prompted with "You forgot to check the squares of negative integers."
All three apologised and recalculated. ChatGPT and Claude then both said -3. The maths hero was Gemini, who then said -4. I am using Gemini Advanced, which may be an unfair advantage.
Worst of all was pi.ai, which still gave "3" even after being re-prompted. But different chatbots have different strengths, and pi is known for its emotional intelligence (so a good choice for coaching if you fear negative feedback).
To be fair to the chatbots, most humans give "3" as the answer as well - and chatbots are language models, not CAS (computer algebra systems).
ChatGPT 40 with canvas - failed.
ChatGPT-01 Mini - passed first time and gave -4. I can't test ChatGPT-01 (Preview) but presumably it would also pass.
Claude 3.5 Sonnet (New) - failed.