Discussion of chess software programming and technical issues.
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

hgm
 Posts: 25860
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller

Contact:
Post
by hgm » Thu Feb 18, 2021 8:26 am
Mike Sherwin wrote: ↑Wed Feb 17, 2021 10:46 pm
In RL an action is defined as any action the actor can do at any specific moment in time like playing g4 on the first move and f3 on the second move.
Says who? Certainly not the definitions I have seen. These are completely abstract, defining actions as (probabilistic) transitions between 'states' of an environment, without putting any requirement on what these states can be. For instace, identifying a 'state' with a pair of integers (nr_of_wins, nr_of_games_played) and an 'action' to playing a game using a certain strategy (e.g. a set of evaluation parameters) satisfies all requirements of the abstract model, just to name one counterexample to your claim.
In chess the entire task is the next move. And the game record is many task performed in linear time. The main problem for a chess engine is often not knowing which move (or moves) led to the other side being able to win. That is why all the winning sides moves (good or bad) get a very small positive adjustment. All the losing sides moves (good or bad) are given a very small negative adjustment. Wrongly adjusted moves are corrected over time providing the adversaries are competent.
The problem is that game results in chess are probabilistic, and whether a strategy was good or bad can only be judged after sufficently many games have been played to get a statistically significant result. Otherwise the learning process would mainly be driven by noise, and would converge only when the learning steps are made so small that they cannot significantly modify the strategy before sufficiently many games have been played to make their accumulated result statistically significant. In that limit it doesn't really make a difference whether I first collect many games under a constant strategy, and then apply the sum of all the strategy corrections the individual games suggest, or apply them one by one.

Mike Sherwin
 Posts: 156
 Joined: Thu Aug 20, 2020 11:25 pm
 Full name: Michael J Sherwin
Post
by Mike Sherwin » Thu Feb 18, 2021 12:15 pm
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents (a chess engine)ought to take actions (a move)in an environment (rules of chess)in order to maximize the notion of cumulative reward(from previous experience).[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing suboptimal actions to be explicitly corrected(moves getting a wrong adjustment immediately corrected). Instead the focus is on finding a balance between exploration (under an evaluator)(of uncharted territory) and exploitation (of current knowledge)(after accumulated RL adjustments).[2]
The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques.[3] The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge (Rl is only concerned with results, not general knowledge)of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.(RL is not for improving an evaluator because an evaluator can never be precise enough in all situations. The goal of RL is to learn over time to differentiate between immediate actions at a given moment in time.)
The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation ... (RL is not about training an approximator. It is about finding optimal actions in exact circumstances)
A reinforcement learning agent interacts with its environment in discrete time steps(moves). At each time t, the agent receives the current state {\displaystyle s_{t}}s_{t} and reward (from accumulated experience){\displaystyle r_{t}}r_{t}. It then chooses an action {\displaystyle a_{t}}a_{t} from the set of available actions, which is subsequently sent to the environment(move made on board). The environment moves to a new state {\displaystyle s_{t+1}}s_{t+1} and the reward {\displaystyle r_{t+1}}r_{t+1} associated with the transition {\displaystyle (s_{t},a_{t},s_{t+1})}(s_{t},a_{t},s_{t+1}) is determined. The goal of a reinforcement learning agent is to learn a policy: {\displaystyle \pi :A\times S\rightarrow [0,1]}{\displaystyle \pi :A\times S\rightarrow [0,1]}, {\displaystyle \pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)}{\displaystyle \pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)} which maximizes the expected cumulative reward.(It's policy is to play the move that scores highest after the accumulated rewards are factored in)
Two elements make reinforcement learning powerful: the use of samples (Can be RT samples of games or game segments because they return information further away from the root)to optimize performance and the use of function approximation (evaluation function)to deal with large environments(as in not solvable like chess). Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:
A model of the environment is known, but an analytic solution is not available;(like in chess)
Only a simulation model of the environment is given (the subject of simulationbased optimization);[5]
The only way to collect information about the environment is to interact with it.(Through real time sampling)
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.
Reinforcement learning requires clever exploration mechanisms(an AB search with an evaluator); randomly selecting actions, without reference to an estimated probability distribution, shows poor performance (Monte Carlo is not very good). The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.(shallow fast searches that can intelligently examine the search space by playing many game segments under control of what it learns after each segment)
The brute force approach entails two steps:
For each possible policy, sample returns while following it (a sample, a game segment is taken for each possible action, move)
Choose the policy with the largest expected return(then play the move with the best adjusted score and then another and another to see outcome further away in time)
Deep reinforcement learning
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.[29] The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or endtoend reinforcement learning. So what AlphaZero and the like do is extend RL. That means that RL itself does not directly cause a differentiation between moves. Instead it trains a general approximator to be a bit stronger. The general approximator is NOT RL. It is an extension of RL. The approximator cannot contain learned rewards for each specific board position and therefore is no longer directly RL.
This is from Wikipedia. I never read it before. And yet it says exactly everything I have said albeit in more general terms. I acknowledge it is not the only source of information. But it makes very clear what RL is and what RL is not. It defines RL exactly how I defined it for myself when I reinvented it in the second half of 2005 based solely on what I remembered from high school about Pavlov's dog experiments.

hgm
 Posts: 25860
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller

Contact:
Post
by hgm » Thu Feb 18, 2021 12:40 pm
Just to clear things up: the black text you quote is from Wikipedia. All the blue stuff sprang from your own imagination.
Note I am not denying that what you describe (where actions are moves, etc.) is a case of reinforcement learning. But it is just one particular example of it. An action can also be a game, or in my case, a match. It could even be swinging a baseball bat with a certain speed at a certain angle.
It seems to me that you want to usurp the the term 'reinforcement learning', and narrow its meaning to a very specific case of what you are doing.

hgm
 Posts: 25860
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller

Contact:
Post
by hgm » Fri Feb 19, 2021 8:20 pm
I came to realize that I overlooked one thing: apart from being forced into a defenseonly role, because your total material is not enough to overwhelm the opponent's Palace, you can also voluntarily choose to defend. That then makes your own Pawns practically useless, but it also makes the opponent's Advisors useless, as he cannot use those for attack. But if you have few Pawns, and the opponent many Advisors, that could actually be a very good deal. E.g. suppose both players have two Advisors, a Rook, two Horses and an Elephant each. the opponent has 3 Pawns, and you only one. So in total material you are two Pawns behind. But by using all your material for defense it has become a battle between his 3 Pawns and your two Advisors, with otherwise equal material. Conventional piece values are such that an Advisor is worth more than a Pawn (and indeed two Advisors would easily draw three Pawns if they were the only material; you would need 5 Pawns to overcome such a defense). So you would really be less than a Pawn behind.
So the switchover to a 'defensive strategy', where you no longer count friendly Pawns and enemy Advisors, should not only occur when your own 'attack surplus' has dwindled below zero, but whenever the opponent's attack surplus in smaller than his total advantage. (And your attack surplus is smaller than his, for otherwise you could still play for a win.)