Has any chess engine employed RT RL yet?

Mike Sherwin · Post by **Mike Sherwin** » Mon Oct 05, 2020 3:52 am

First one must know what reinforcement learning is.

REINFORCEMENT LEARNING
Here is my explanation quoted on the Chess Programming Wiki.

RomiChess is famous for its learning approach [2], and uses bitboards as basic data structure, in particular Sherwin Bitboards to determine sliding piece attacks [3]. Its search is alpha-beta with transposition table, null move pruning and LMR inside an iterative deepening framework with aspiration windows. Romi's evaluation features an oracle approach of pre-processing piece-square tables at the root

As explained by Michael Sherwin, RomiChess uses two types of learning [5] :

1. Monkey see Monkey do. Romi remembers and incorporates winning lines regardless of which side played the moves into the opening book and can play them back instantly up to 180 ply if the stats for that line remain good.
2. Pavlov's dog experiments adapted to computer chess. Each sides moves are given a slight bonus if that side has won and the other sides moves are given a slight penalty. So, good moves can get a slight penalty and bad moves can get a slight bonus, however, through time those are corrected. These bonus/penalties are loaded into the hash table before each move by the computer. If Romi is loosing game after game then this will cause Romi to 'fish' for better moves to play until Romi starts to win.

Reinforcement learning is 2. above.

ANSWER TO
QUESTION 1
First ask yourself this question (I don't know the answer as I never got that far), if the main search can search 40 ply in a given amount of time and you took half that time to first play a number of games (or segments of games) searching each move to 10 ply, how many games could be played? Since games are being played they are most often going to contain more moves than 40 ply (20 full moves).

QUESTION 2
It would be a separate hashtable which is more useful than a tree structure because transpositions can be detected.

QUESTION 3
The RL database accumulates rewards and penalties for each position that is in the database. Before the main search the database positions are loaded into the main hashtable so the main search can take advantage of the learned values.

Questions 4 - a, b and c
a) Yes, since the RL database is loaded into the main hashtable before the main search begins the main hashtable then contains the learning from the RL phase. Therefore the main hashtable will perform much better.
b) I did not say evaluation function. I said the evaluations. I meant the evaluations from the main hashtable as they are propagated down to the root. Sorry for being vague in my post above. I'll take the blame for that!

c) Because, the main hashtable is more accurate due to the RL adjustments.

hgm · Post by **hgm** » Mon Oct 05, 2020 10:36 am

Reinforcement learning is a very broad term. It is just one of the methods of learning, and anything that can be learned can also be learned with this method. To be a bit more specific, we would have to know what exactly is learned. There are many aspects to a chess engine: search, evaluation, opening book... Each of thos is subject to many parameters, which can be learned, and thus also learned by RL.

Furthermore, even the process of RL itself can be conducted in many ways. When I am hand-crafting the evaluation function of an engine, and I am using my judgement of its playing style to alter one of the evaluation parameters to encourage or discourage a certain style aspect I noticed in the games, that is also RL, conducted by had.

What is described here seems to be an automated implementation of RL for the purpose of improving an opening book. The book is implemented in a bit of an unusual way, through initialization of the hash table.

Mike Sherwin · Post by **Mike Sherwin** » Mon Oct 05, 2020 12:03 pm

Yes, thank you H.G.! There are a couple things to note. This learning, since no data need be saved from one game to the next, would qualify for CEGT and CCRL. Even though philosophically it can be likened to a self constructed book created by RL it is non deterministic in that the less deep but wider main search is only influenced by the values in the HT and can override the RL. The RL values stay in the main hash and influence the search for many moves. And the learned games remain in the RL database between game moves and are added to in the next RL phase. Even the SF NNUE is static because it is all after game RL as far as I understand it. The more dynamic RTRL applies all the learning to just one root position at a time and 100 games is usually more than enough to make a substantial improvement in play. Back in the day when Romi the learning version first came out it could even beat Rybka in a 100 game match from the same opening position.

jwilson82 · Post by **jwilson82** » Mon Oct 05, 2020 4:43 pm

My two questions are:

What, specifically, is available in the hash table as a result of the learn phase?

How, specifically, does this information guide the search during play?

I think I have an answer to the first based on my understanding of Romi (which is what I assume you are proposing as an implementation of RTRL)

The learn phase plays games where individual moves are searched to a shallow depth but the game itself is played to completion. At the end of each game, go back through the positions that we reached (not searched, actually played) and assign a bonus/malus to the positions of the winning/losing side, presumably discounted by the distance from the end of the game. Do that 100 times (or as many times as you can in the time alloted). I am omitting the discussion W/D/L statistics here because those seem to be more for driving an opening book (where you hope to make an immediate decision) as opposed to guiding a search.

As for the second question, I'm not understanding what to do with this RL score when you encounter it in the hash table. It must be used to alter the score returned for the node in some way (else it would not be able to "guide" the search), but I'm not seeing how. Move ordering could be potentially useful for speed, but it won't change the score at the end of the PV (absent effects from move-order-based pruning/reduction)

I'm not sure the Rybka example applies here because in that situation you had feeback from Rybka itself, where in this situation you do not. Romi's assumptions about what Rybka might do can be different than what Rybka actually does. Not that it won't be useful, just pointing out that it isn't an apples-to-apples comparison

Mike Sherwin · Post by **Mike Sherwin** » Mon Oct 05, 2020 6:01 pm

jwilson82 wrote: ↑Mon Oct 05, 2020 4:43 pm My two questions are:
What, specifically, is available in the hash table as a result of the learn phase?

How, specifically, does this information guide the search during play?
I think I have an answer to the first based on my understanding of Romi (which is what I assume you are proposing as an implementation of RTRL)

The learn phase plays games where individual moves are searched to a shallow depth but the game itself is played to completion. At the end of each game, go back through the positions that we reached (not searched, actually played) and assign a bonus/malus to the positions of the winning/losing side, presumably discounted by the distance from the end of the game. Do that 100 times (or as many times as you can in the time alloted). I am omitting the discussion W/D/L statistics here because those seem to be more for driving an opening book (where you hope to make an immediate decision) as opposed to guiding a search.

As for the second question, I'm not understanding what to do with this RL score when you encounter it in the hash table. It must be used to alter the score returned for the node in some way (else it would not be able to "guide" the search), but I'm not seeing how. Move ordering could be potentially useful for speed, but it won't change the score at the end of the PV (absent effects from move-order-based pruning/reduction)

I'm not sure the Rybka example applies here because in that situation you had feeback from Rybka itself, where in this situation you do not. Romi's assumptions about what Rybka might do can be different than what Rybka actually does. Not that it won't be useful, just pointing out that it isn't an apples-to-apples comparison

You answered the first question perfectly. TG because that saves me a lot of work.

All, I can remember, without spending a lot of time relearning exactly how it was done in RomiChess, is that learning method 2 Pavlov's dog experiments (PDE) adapted to computer chess, was done first and it worked. I have regretted adding, Monkey see Monkey do (MSMD) book learning (BL), because it hides the PDE (RL) as people seem to only see MSMD (BL). If you have or can find a compiler that supports the MS extension of nameless structs like MSVC 6 or MSVS 2005 then I'd suggest just disabling MSMD and seeing for yourself. My best guess is that in the learn file an actual ... well I just don't remember, so I'm not going to guess. But somehow the RL values have to be applied to a score and that score seems like it should also come from the learn file because at this time my morning caffeine pill has not kicked in yet and I can't imagine any other way. If I get the ambition to relearn my sources I may report back. My apologies for not being more helpful at this moment.

EDIT: Maybe this will help.
http://www.open-aurec.com/wbforum/viewt ... ing#p24950

Since I touted Romis' learning ability in another thread I thought that I would elaborate on it a bit here. Especially since Romis' source code is no longer available from WBEC.

First let me start with ver. p2i. In p2i Romi maintained two threads/trees in a learn file, one for white and one for black. After each game the moves of the game up to ply 160 along with some pertinate information such as the score for each move was placed in the learn file using the aproprtate thread. if enough moves had been tried at any given node then the best move is backed up one ply. This eventually allows in theory, information from upto 160 ply away from the start position to be used in the search. On beginning a search the current node and its subtree is loaded into the hash file. The search then of course uses the values in the hash tables to help select a move.

The results of this is that Romi would learn to beat much stronger programs in a match, if given enough rounds to do so, even if the other program also has learning. As an example only, since I ran this test many months ago, Romi played a match with Crafty19.19 with no opening books, learning turned on and after about 50 rounds Romi won every game up to round 250 when the matched was stoped.

Now I will explain Romis' learning as it became to be in p3h. In p3c (or there abouts) was added the ability to use the learn file as an opening book by using the white win, black win and draw statistics to just pick a move with out searching. Later the two threads of white and black were made into one thread so Romi could learn and play the moves of her opponents when she is beaten by them. This book learning worked so well that I did not notice that I had broken regular learning. Then I made extensive changes to the compute function to solve a move selection bug and some other changes to make Romi play better and didn't even realize that I had broken book learning as well. Version p3g has no learning at all, because it is all broken.

This decribes the way it is now, in p3h. One thread. Can load several million games from pgn file (must strip comments and variations first) into the learn file. Can select random book moves. If no pgn file is loaded then Romi creates her own opening book from her own games as they are played. In a match with Crafty19.19 as described above with 60 games having been played, Romi is in the lead +28 -23 =9. Romi is still loosing games though and this is most likely due to some random selection. I will edit this post when the marathon match is complete.

RomiChessP3h is not available yet, but as soon as I am satisfied with it I will make the executibles available at Leos' WBEC website. There currently is no download available.

jwilson82 · Post by **jwilson82** » Mon Oct 05, 2020 6:11 pm

There's no need to guess if the source is available - I'm happy to read it myself. I certainly understand the necessity of caffeine - so much so that it will undoubtedly be the name of my next engine

I wonder, if the learn value was scaled [-1, 1], if you could use it to scale the result of a search of the position. Or maybe more correct would be to scale the result of the evaluation function at the leaves, not unlike how some engines scale the score toward 0 for material combinations that look one-sided but are known to be "drawish".

Mike Sherwin · Post by **Mike Sherwin** » Mon Oct 05, 2020 6:28 pm

jwilson82 wrote: ↑Mon Oct 05, 2020 6:11 pm There's no need to guess if the source is available - I'm happy to read it myself. I certainly understand the necessity of caffeine - so much so that it will undoubtedly be the name of my next engine

I wonder, if the learn value was scaled [-1, 1], if you could use it to scale the result of a search of the position. Or maybe more correct would be to scale the result of the evaluation function at the leaves, not unlike how some engines scale the score toward 0 for material combinations that look one-sided but are known to be "drawish".

Good name for a chess engine!

I edited my above post with a quote from 2006 explaining the progression of Romi's learning code. Yes, there is a score maintained in learn.dat.

Mike Sherwin · Post by **Mike Sherwin** » Mon Oct 05, 2020 6:32 pm

jwilson82 wrote: ↑Mon Oct 05, 2020 6:11 pm There's no need to guess if the source is available - I'm happy to read it myself. I certainly understand the necessity of caffeine - so much so that it will undoubtedly be the name of my next engine

I wonder, if the learn value was scaled [-1, 1], if you could use it to scale the result of a search of the position. Or maybe more correct would be to scale the result of the evaluation function at the leaves, not unlike how some engines scale the score toward 0 for material combinations that look one-sided but are known to be "drawish".

RomiChessP3n source code.
http://kirill-kryukov.com/chess/discuss ... p?id=40457
[-1, 1] scaling sounds like a very good idea!

Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?

Re: Has any chess engine employed RT RL yet?