Reinforcement Learning (RL) in real time paradigm

Michael Sherwin · Post by **Michael Sherwin** » Mon Jan 14, 2019 8:04 pm

HOW IT CAN BE DONE USING ONLY ONE THREAD
From any given position using more shallow alpha-beta searches play a game of chess (or fragment of a game) until a decisive (or pseudo decisive) result. Add the moves into a tree structure in ram updating the the RL values. Play as many games as possible in say 1/2 the given time. The games will vary due to RL. Load the tree structure into the hash file and do a normal a-b search. This will take RomiChess style post game RL into realtime RL. It will make results from very deep lines intelligently investigated available to the main search.

ON MANY THREADS
The initial problem is that all the threads would be "chasing the same rabbit". That just means that the threads would need to be evenly distributed over the search tree. Therefore a preliminary tree structure N ply deep N depending on how many threads are in use would be placed in ram. Then each thread would be assigned a different "end leaf" as a starting position. End leafs can be invalidated during this process so that they need not be further investigated so that the most promising can be investigated more intensely.

I have been wanting to do this for a long time. Circumstances prevented me from trying. Now I'm officially sharing this idea. It is my opinion that this will produce an ELO super monster!

Michael Sherwin · Post by **Michael Sherwin** » Tue Jan 15, 2019 7:27 am

Michael Sherwin · Post by **Michael Sherwin** » Tue Jan 15, 2019 10:04 am

Is there nobody that would like to talk about this idea? I think it would be possible to play thousands of training games before each main search. The main search's eval function would not need to be used (but could be used). This hinges on another idea that I have posted about. That would be a search based on statistics instead. The basic most form of that statistical search would be counting all the beta cutoffs (also checkmates) for each side for each root moves subtree. Then a ratio between the two is computed. The move with the biggest ratio is played. The RL bonus and malus is then applied to the games moves which will vary the play in subsequent games. Captures tend to have the best ratio so the ratio would have to be adjusted so that captures would not automatically be played. Even if it is not a perfect adjustment the RL values would push bad captures away from being looked at as RL would dictate. When the subtree is loaded into the hashtable before the main search it will greatly influence the search. Also the RL values could be used in LMR and move ordering allowing the main search to get deeper faster. In RomiChess a thousand games from any given position resulted in hundreds of ELO gained allowing RomiChess to win against the worlds best programs as has been demonstrated time and time again. Basically the search tree would be investigated very intelligently instead of MC's Random investigation. Also ratings testing groups like CCRL would have no objection as the learning need not be saved between competition games. Is there really nobody that wants to discuss this?

abulmo2 · Post by **abulmo2** » Tue Jan 15, 2019 10:13 am

I am not sure to understand your proposal.
To me Reinforcement learning (RL) is what is often called here Texel Tuning, ie a way to tune the eval parameters from played game. Your proposal looks like building an opening book (from the current position) and loading it to the hash table to improve the normal alpha-beta search. I doubt of the quality of it when done in real time. Keeping the tree in memory also looks like what is done in MCTS, not sure it works better than alpha-beta + hashtable. Some old papers demonstrated that search algorithms with a search tree kept in memory (like SS*, etc.) are equivalent to alphabeta + hashtable.
Do you have some preliminary results to show us how much elo you expect to gain ?

Michael Sherwin · Post by **Michael Sherwin** » Tue Jan 15, 2019 10:36 am

abulmo2 wrote: ↑Tue Jan 15, 2019 10:13 am I am not sure to understand your proposal.
To me Reinforcement learning (RL) is what is often called here Texel Tuning, ie a way to tune the eval parameters from played game. Your proposal looks like building an opening book (from the current position) and loading it to the hash table to improve the normal alpha-beta search. I doubt of the quality of it when done in real time. Keeping the tree in memory also looks like what is done in MCTS, not sure it works better than alpha-beta + hashtable. Some old papers demonstrated that search algorithms with a search tree kept in memory (like SS*, etc.) are equivalent to alphabeta + hashtable.
Do you have some preliminary results to show us how much elo you expect to gain ?

I only have my engine RomiChess as experience in this matter although RomiChess uses post game learning. It is a misconception that RomiChess's RL learning is some kind of book. It is not. However RomiChess has a second learning algorithm called monkey see monkey do that does create an opening book up to 180 ply deep. In a test against Glaurung 2 in which Romi played Glaurung using the Nunn 10 pgn set ten times for a total of 200 games Romi's score went from 5% in run 1 to 95% in run 10. In a 6000 game test against the 4 top engines using Bob's humongous opening book Romi gained 50 ELO from the first 1000 games to the last 1000 games. Also many 100 game test have been done against top engines from a given position with Romi outperforming each engine in the last 50 games. Even a more recent test against SF 8 (or maybe SF 9) Romi went from 17% in the first 50 games to iirc 83 or 87 percent in the last 50 games. Most authors never would acknowledge these results but there are other members here that would back up what I'm claiming. The best example was Leo Dicksman's (now deceased) tournaments in which Romi climbed from the D division to the B division just on the strength of Romi's learning. And Romi was about to make it to the A division before a hard drive failure lost the learn file. That was many years before Alpha Zero.

To make it clear the RL that I propose is not an opening book. It merely tweaks the values in the hashtables. The normal alpha-beta search still selects the move! Thanks for your response!!

Edit: The learn tree in memory is not accessed by the main search. As each training game is played the moves are overlaid in the learning tree and the RL values are adjusted. The tree is not traversed in any search. A mini tree 1 to a few ply deep depending on the number of threads may be traversed to assign threads different starting positions to investigate.

PK · Post by PK » Tue Jan 15, 2019 11:39 am

I'm afraid this algorithm would search too many moves irrelevent for the second search to be competitive. But when I'm analysing positions under Arena, I often do the following: short search, taking back, longer search. Some tactics are found faster that way. I'd suggest the following: try that approach (short search + long search) first. If it works then try 2 short searches + one long search. Increase number of short searche until it stops to work.

Michael Sherwin · Post by **Michael Sherwin** » Tue Jan 15, 2019 4:14 pm

PK wrote: ↑Tue Jan 15, 2019 11:39 am I'm afraid this algorithm would search too many moves irrelevent for the second search to be competitive. But when I'm analysing positions under Arena, I often do the following: short search, taking back, longer search. Some tactics are found faster that way. I'd suggest the following: try that approach (short search + long search) first. If it works then try 2 short searches + one long search. Increase number of short searche until it stops to work.

Thanks for your reply. I'm at a loss to understand why everyone thinks that. I had that same argument with Bob 12 years ago. It has been my experience with RomiChess that very few games are needed to cause Romi to play stronger moves at the root. Bob's argument went sort of like this, if Romi learned e4 e5 then crafty would switch to ... c5 then Bob took the argument one ply deeper. My response was something like but if Romi can determine between 1e4 and 1d4 which is better for Romi then Romi has benefited. And the argument went on. Then Bob said something like that his humongous book would make Romi's learning useless. That was the reason someone played Romi 6000 games against top engines using Bob's humongous book. Bob was proven wrong. It is my humble opinion that your argument is also wrong. It takes very few training games to differentiate between two root moves. Then a few more to start differentiating farther up the tree. It just leads the engine to make better moves.

Rein Halbersma · Post by **Rein Halbersma** » Tue Jan 15, 2019 4:52 pm

Michael Sherwin wrote: ↑Mon Jan 14, 2019 8:04 pm HOW IT CAN BE DONE USING ONLY ONE THREAD
From any given position using more shallow alpha-beta searches play a game of chess (or fragment of a game) until a decisive (or pseudo decisive) result. Add the moves into a tree structure in ram updating the the RL values. Play as many games as possible in say 1/2 the given time. The games will vary due to RL. Load the tree structure into the hash file and do a normal a-b search. This will take RomiChess style post game RL into realtime RL. It will make results from very deep lines intelligently investigated available to the main search.

ON MANY THREADS
The initial problem is that all the threads would be "chasing the same rabbit". That just means that the threads would need to be evenly distributed over the search tree. Therefore a preliminary tree structure N ply deep N depending on how many threads are in use would be placed in ram. Then each thread would be assigned a different "end leaf" as a starting position. End leafs can be invalidated during this process so that they need not be further investigated so that the most promising can be investigated more intensely.

I have been wanting to do this for a long time. Circumstances prevented me from trying. Now I'm officially sharing this idea. It is my opinion that this will produce an ELO super monster!

A similar idea has been proposed before I think: https://papers.nips.cc/paper/3722-boots ... search.pdf

Guenther · Post by **Guenther** » Tue Jan 15, 2019 5:02 pm

Rein Halbersma wrote: ↑Tue Jan 15, 2019 4:52 pm
A similar idea has been proposed before I think: https://papers.nips.cc/paper/3722-boots ... search.pdf

It must be over 10 years I have seen Joels name mentioned somewhere, thanks for the paper.
https://www.chessprogramming.org/Bodo
I didn't even know he created 'Meep', because that was shortly before a long hiatus on my side.
https://www.chessprogramming.org/Meep

One of those shorter exchanges between Michael and Bob in the WB forum about Romis learning.
http://www.open-aurec.com/wbforum/viewt ... f=4&t=4835

Michael Sherwin · Post by **Michael Sherwin** » Tue Jan 15, 2019 8:49 pm

Rein Halbersma wrote: ↑Tue Jan 15, 2019 4:52 pm
Michael Sherwin wrote: ↑Mon Jan 14, 2019 8:04 pm HOW IT CAN BE DONE USING ONLY ONE THREAD
From any given position using more shallow alpha-beta searches play a game of chess (or fragment of a game) until a decisive (or pseudo decisive) result. Add the moves into a tree structure in ram updating the the RL values. Play as many games as possible in say 1/2 the given time. The games will vary due to RL. Load the tree structure into the hash file and do a normal a-b search. This will take RomiChess style post game RL into realtime RL. It will make results from very deep lines intelligently investigated available to the main search.

ON MANY THREADS
The initial problem is that all the threads would be "chasing the same rabbit". That just means that the threads would need to be evenly distributed over the search tree. Therefore a preliminary tree structure N ply deep N depending on how many threads are in use would be placed in ram. Then each thread would be assigned a different "end leaf" as a starting position. End leafs can be invalidated during this process so that they need not be further investigated so that the most promising can be investigated more intensely.

I have been wanting to do this for a long time. Circumstances prevented me from trying. Now I'm officially sharing this idea. It is my opinion that this will produce an ELO super monster!
A similar idea has been proposed before I think: https://papers.nips.cc/paper/3722-boots ... search.pdf

Not really that close. Tweaking the eval parameters makes a stronger overall program but the new stronger eval would still not be perfect for any given position especially when tactics is involved. Romi's RL learning tweaks each position in her experience independent of the eval function. This allows backing up learning values from very deep lines which often incorporate very deep tactics. And also positional motifs that an eval function can not be aware of.

Reinforcement Learning (RL) in real time paradigm

Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm

Re: Reinforcement Learning (RL) in real time paradigm