Strawberry was released last night by OpenAI.
It is not the ChatGPT 5 I was hoping for, but it seems to be a stepping stone to GPT5.
The major advances are impoved reasoning and massive reduction of hallucinations.
The AI takes about 30 seconds to answer a question (because it "thinks") and the number of requests even paying customers can use is very limited. Therefore a game of chess with it is out of the question - we'd only play about 15 moves per month, and I'd rather use these tokens for other things.
There is a dark rumour that an unlimited version will come out soon for $2000/month.
Anyway, I tested it on an opening positon which stumped cheap chess computers until c.1985 and no version of ChatGPT has been able to solve so far:
1.e4 e5
2.Nf3 g6
3.Nc3 d6
4.Bc4 Bg4
And I asked it to play white. It got the right move, gave it an exclamation mark, and explained why.
Note I jumbled up the opening moves so there was no way it was playing "rote".
Impressed.
Gemini
Moderators: hgm, chrisw, Rebel
-
- Posts: 12106
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
Looks as though only subscribers have access to it right now. I watched this video of its capabilities:
From that video, I took the question, "How many of the letter r are in the word strawberry?"
The following LLM got it right: standard ChatGPT (4o).
The following LLMs got it wrong (all answered 2): Gemini Advanced, Claude, Pi.ai
Conclusion: different LLMs have different strengths, and ChatGPT is strong in whatever skill is required to answer this particular question.
I admit to being disappointed in Gemini Advanced here, though: I had previously thought of it as being the most intelligent LLM, and it is capable of solving maths problems (and explaining the solutions in detail) that the others cannot (not going to test it on the one shown in the video, though - unless someone writes it out in full for me).
Also on that video, he has Strawberry write a chess program. It's a good start (wasn't able to play chess properly though) - but disappointed that he didn't notice that the starting position is wrong!
Want to attract exceptional people? Be exceptional.
-
- Posts: 3021
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Gemini
Haha Nvidia market cap rises from ~2.5B USD to ~2.9B USD after Strawberry release
https://stockanalysis.com/stocks/nvda/market-cap/
One on Tech Bubbles
https://luddite.app26.de/post/one-on-tech-bubbles/
--
Srdja
-
- Posts: 12106
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
This is a news report, so not really sure how credible it is, but this article says that Strawberry is concerning in 2 ways:
1. It deceives people about how it's doing its reasoning: this is probably less a case of deception and more a case that it's prohibitively difficult to work out what's going on in a large NN. The explanation is useful even if it's dishonest.
2. Apparently, it's clever enough to be dangerous in the hands of somebody who jailbreaks it: they could use it to help them do bad things, or make dangerous things
https://www.vox.com/future-perfect/3718 ... strawberry
1. It deceives people about how it's doing its reasoning: this is probably less a case of deception and more a case that it's prohibitively difficult to work out what's going on in a large NN. The explanation is useful even if it's dishonest.
2. Apparently, it's clever enough to be dangerous in the hands of somebody who jailbreaks it: they could use it to help them do bad things, or make dangerous things
https://www.vox.com/future-perfect/3718 ... strawberry
Want to attract exceptional people? Be exceptional.
-
- Posts: 1927
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Gemini
I have now spent many hours writing a chess engine with 4 AIs.
There is no doubt about it the order of coding usefulness goes like this:
1) ChatGPT-01 Mini
2) ChatGPT-01 Preview
Then miles behind
3) Gemini Advanced
4) Claude Sonnet 3.5
There is no doubt about it the order of coding usefulness goes like this:
1) ChatGPT-01 Mini
2) ChatGPT-01 Preview
Then miles behind
3) Gemini Advanced
4) Claude Sonnet 3.5
-
- Posts: 12106
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
What ChatGPT-01 seems to be doing to achieve more steps of logic is, roughly:
1. Solving a bit of what has been asked
2. Taking this bit of the solution and automatically re-prompting itself to get the next bit
So basically - using itself iteratively. Would you agree?
In case anyone is unaware, the price of using ChatGPT-01 is a lot higher than other LLMs (in the sense that you are allowed far fewer prompts for your subscription fee), but it will still be the best option for some use cases.
Want to attract exceptional people? Be exceptional.
-
- Posts: 3021
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Gemini
Maybe worth to mention that Strawberry/ChatGPT-01 uses meanwhile some kind of reinforcement-learning (think transition of AlphaGo to AlphaZero) and additional tree of thoughts for reasoning:
https://en.wikipedia.org/wiki/Reinforcement_learning
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
https://arxiv.org/abs/2305.10601
Srdja
https://en.wikipedia.org/wiki/Reinforcement_learning
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
https://arxiv.org/abs/2305.10601
--Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.
Srdja
-
- Posts: 12106
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
smatovic wrote: ↑Sat Sep 28, 2024 7:30 amTree of Thoughts: Deliberate Problem Solving with Large Language Models
https://arxiv.org/abs/2305.10601
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.
Very good find, and a good read! Thank you.
Users are forbidden from trying to work out how o1 works (link), but hopefully open source systems with similar capabilities will become available in the not-too-distant future. The text above certainly looks promising in this respect. It's possible that an important reason why OpenAI wants to hide the way o1 works is that what they've done is actually close to what's in that paper.
Btw - I know it's not the point - but I have no doubt that a chess programmer could build a better Game Of 24 solver without using an LLM! LLMs obviously aren't always the best solution.
Want to attract exceptional people? Be exceptional.
-
- Posts: 12106
- Joined: Thu Mar 09, 2006 12:57 am
- Location: Birmingham UK
- Full name: Graham Laight
Re: Gemini
smatovic wrote: ↑Sat Sep 28, 2024 7:30 amTree of Thoughts: Deliberate Problem Solving with Large Language Models
https://arxiv.org/abs/2305.10601
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.
One more thing about this: these trees would have similarities to game trees in computer chess, in that they're both generated data in tree structures. This could be an important force multiplier for chatbots going forward. Here's a brief overview of the history of game trees in chess:
* hand coded evaluations (HCEs) used to be weak (in the sense that they misevaluated a lot of chess positions)
* game trees acted as a force multiplier to these HCEs
* likewise, HCEs acted as a force multiplier to the weaknesses of game trees (especially the horizon effect)
* as computers got faster, engine developers found that the enlarged game tree was more useful than snippets of knowledge in the HCE (and some of that knowledge even became redundant), so knowledge was sacrificed in order to generate a larger game tree
* DeepMind showed us that using NNs for the eval instead of HCE gave a huge boost in playing strength - despite the fact that they took a lot longer to run than HCEs (and hence reduced the size of the game tree)
* so now again, eval and game tree are now both acting as force multipliers to each other
Now, the study linked by Srdja is showing us that trees can act as force multipliers to LLMs.
It is crystal clear to me that there is enough similarity between chatbot progress and chess progress to merit discussion about analogy between chess and chat!
Want to attract exceptional people? Be exceptional.
-
- Posts: 3021
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Gemini
You are not the only one to draw these similarities:
Re: Insight About Genetic (Evolutionary) Algorithms
viewtopic.php?p=949466#p949466
--
Srdja