REINFORCEMENT LEARNING
Here is my explanation quoted on the Chess Programming Wiki.
RomiChess is famous for its learning approach [2], and uses bitboards as basic data structure, in particular Sherwin Bitboards to determine sliding piece attacks [3]. Its search is alpha-beta with transposition table, null move pruning and LMR inside an iterative deepening framework with aspiration windows. Romi's evaluation features an oracle approach of pre-processing piece-square tables at the root
Reinforcement learning is 2. above.As explained by Michael Sherwin, RomiChess uses two types of learning [5] :
1. Monkey see Monkey do. Romi remembers and incorporates winning lines regardless of which side played the moves into the opening book and can play them back instantly up to 180 ply if the stats for that line remain good.
2. Pavlov's dog experiments adapted to computer chess. Each sides moves are given a slight bonus if that side has won and the other sides moves are given a slight penalty. So, good moves can get a slight penalty and bad moves can get a slight bonus, however, through time those are corrected. These bonus/penalties are loaded into the hash table before each move by the computer. If Romi is loosing game after game then this will cause Romi to 'fish' for better moves to play until Romi starts to win.
ANSWER TO
QUESTION 1
First ask yourself this question (I don't know the answer as I never got that far), if the main search can search 40 ply in a given amount of time and you took half that time to first play a number of games (or segments of games) searching each move to 10 ply, how many games could be played? Since games are being played they are most often going to contain more moves than 40 ply (20 full moves).
QUESTION 2
It would be a separate hashtable which is more useful than a tree structure because transpositions can be detected.
QUESTION 3
The RL database accumulates rewards and penalties for each position that is in the database. Before the main search the database positions are loaded into the main hashtable so the main search can take advantage of the learned values.
Questions 4 - a, b and c
a) Yes, since the RL database is loaded into the main hashtable before the main search begins the main hashtable then contains the learning from the RL phase. Therefore the main hashtable will perform much better.
b) I did not say evaluation function. I said the evaluations. I meant the evaluations from the main hashtable as they are propagated down to the root. Sorry for being vague in my post above. I'll take the blame for that!
c) Because, the main hashtable is more accurate due to the RL adjustments.