Mobility eval

hgm · Post by **hgm** » Tue May 01, 2012 10:35 pm

Don wrote:The table is non-linear so that having some mobility is more important than having a lot. In other words if one bishop is highly mobile and the other has little mobility it is much more important to help the bishop with low mobility.

I have always felt this way too, even to the point where I considered scoring different directions of the same piece that way. E.g. a Rook that has 2 moves along a file and 2 along a rank is better than a Rook with 4 moves along a file and 0 along the rank. Unfortunately the way I plan to implement mobility now makes it kind of difficult to implement such non-linearity.

If you look upon mobility as board control, however, such non-linearity makes less sense. For that purpose it would be more logical to punish 'over-contolling' of squares, e.g. aiming 5 pieces at the same square, amongst which a Pawn, while the opponent only has a single attack on it. It would probably have been better to divert part of the resources to controlling two other, now uncontrolled squares, i.e. leave this one for the Pawn, and direct two pieces to each of two other squares. This kind of non-linearity (per square rather than per piece) would be easier to do in the calculational scheme I have in mind.

chrisw · Post by **chrisw** » Wed May 02, 2012 10:48 pm

hgm wrote:
Don wrote:The table is non-linear so that having some mobility is more important than having a lot. In other words if one bishop is highly mobile and the other has little mobility it is much more important to help the bishop with low mobility.
I have always felt this way too, even to the point where I considered scoring different directions of the same piece that way. E.g. a Rook that has 2 moves along a file and 2 along a rank is better than a Rook with 4 moves along a file and 0 along the rank. Unfortunately the way I plan to implement mobility now makes it kind of difficult to implement such non-linearity.

If you look upon mobility as board control, however, such non-linearity makes less sense. For that purpose it would be more logical to punish 'over-contolling' of squares, e.g. aiming 5 pieces at the same square, amongst which a Pawn, while the opponent only has a single attack on it. It would probably have been better to divert part of the resources to controlling two other, now uncontrolled squares, i.e. leave this one for the Pawn, and direct two pieces to each of two other squares. This kind of non-linearity (per square rather than per piece) would be easier to do in the calculational scheme I have in mind.

Well, yes, but there is concept of overprotection. Often the best and often a game winning strategy. I guess the problem is that if we define particular eval subfunctions, eg mobility, we can get to the point that the myriads of possible tweaks of a particular sub function can bring it within the scope of another or an undefined subfunction.

I'm not really that familiar anymore with recent developments, for example multiple game testing of tweaks. Suppose you were to implement your idea above, how many games, and importantly, how much time would a developer need to proof the idea and what size of weighting it should be given?

Don · Post by **Don** » Wed May 02, 2012 11:14 pm

chrisw wrote:
hgm wrote:
Don wrote:The table is non-linear so that having some mobility is more important than having a lot. In other words if one bishop is highly mobile and the other has little mobility it is much more important to help the bishop with low mobility.
I have always felt this way too, even to the point where I considered scoring different directions of the same piece that way. E.g. a Rook that has 2 moves along a file and 2 along a rank is better than a Rook with 4 moves along a file and 0 along the rank. Unfortunately the way I plan to implement mobility now makes it kind of difficult to implement such non-linearity.

If you look upon mobility as board control, however, such non-linearity makes less sense. For that purpose it would be more logical to punish 'over-contolling' of squares, e.g. aiming 5 pieces at the same square, amongst which a Pawn, while the opponent only has a single attack on it. It would probably have been better to divert part of the resources to controlling two other, now uncontrolled squares, i.e. leave this one for the Pawn, and direct two pieces to each of two other squares. This kind of non-linearity (per square rather than per piece) would be easier to do in the calculational scheme I have in mind.
Well, yes, but there is concept of overprotection. Often the best and often a game winning strategy. I guess the problem is that if we define particular eval subfunctions, eg mobility, we can get to the point that the myriads of possible tweaks of a particular sub function can bring it within the scope of another or an undefined subfunction.

I'm not really that familiar anymore with recent developments, for example multiple game testing of tweaks. Suppose you were to implement your idea above, how many games, and importantly, how much time would a developer need to proof the idea and what size of weighting it should be given?

Hi Chris.

I'm not sure I understand the question, but in the general case when implementing some change or new "concept" we will start with an educated guess about what the right values to test are. Then we will run about 10,000 games at a very fast level (game in 2 or 3 seconds) to get a rough idea of where the idea stands and then we follow up by testing values a bit smaller and a bit larger in additional tests. If one of them shows a big improvement it's an indication that we need to move in that direction and more tests will follow. This part is a "black art" and guided by intuition and superstition. Then we graduate to much longer levels with the most likely candidates because we don't accept any change tested at ridiculously fast time controls.

We are forced to do a lot of testing at these time controls however because there is so much statistical noise in small samples. For example even after 1000 games you can show a 10-20 ELO improvement which in fact is a regression. You need 100,000 games to really be semi-confident in a 2 ELO change. So we don't have the resources to test like we wish we could.

When we did the "progressive" mobility as I described above, it was a major improvement for us, we could see that with only a small number of games but that is not typical, most of our improvements are difficult to measure.

There are some automated means to try to zero in on the right values, we have experimented with that in the past but we still rely mostly on intuition guided weight tuning.

Did that answer your question or did I understand your question incorrectly?

chrisw · Post by **chrisw** » Thu May 03, 2012 10:27 am

Don wrote:
chrisw wrote:
hgm wrote:
Don wrote:The table is non-linear so that having some mobility is more important than having a lot. In other words if one bishop is highly mobile and the other has little mobility it is much more important to help the bishop with low mobility.
I have always felt this way too, even to the point where I considered scoring different directions of the same piece that way. E.g. a Rook that has 2 moves along a file and 2 along a rank is better than a Rook with 4 moves along a file and 0 along the rank. Unfortunately the way I plan to implement mobility now makes it kind of difficult to implement such non-linearity.

If you look upon mobility as board control, however, such non-linearity makes less sense. For that purpose it would be more logical to punish 'over-contolling' of squares, e.g. aiming 5 pieces at the same square, amongst which a Pawn, while the opponent only has a single attack on it. It would probably have been better to divert part of the resources to controlling two other, now uncontrolled squares, i.e. leave this one for the Pawn, and direct two pieces to each of two other squares. This kind of non-linearity (per square rather than per piece) would be easier to do in the calculational scheme I have in mind.
Well, yes, but there is concept of overprotection. Often the best and often a game winning strategy. I guess the problem is that if we define particular eval subfunctions, eg mobility, we can get to the point that the myriads of possible tweaks of a particular sub function can bring it within the scope of another or an undefined subfunction.

I'm not really that familiar anymore with recent developments, for example multiple game testing of tweaks. Suppose you were to implement your idea above, how many games, and importantly, how much time would a developer need to proof the idea and what size of weighting it should be given?
Hi Chris.

I'm not sure I understand the question, but in the general case when implementing some change or new "concept" we will start with an educated guess about what the right values to test are. Then we will run about 10,000 games at a very fast level (game in 2 or 3 seconds) to get a rough idea of where the idea stands and then we follow up by testing values a bit smaller and a bit larger in additional tests. If one of them shows a big improvement it's an indication that we need to move in that direction and more tests will follow. This part is a "black art" and guided by intuition and superstition. Then we graduate to much longer levels with the most likely candidates because we don't accept any change tested at ridiculously fast time controls.

We are forced to do a lot of testing at these time controls however because there is so much statistical noise in small samples. For example even after 1000 games you can show a 10-20 ELO improvement which in fact is a regression. You need 100,000 games to really be semi-confident in a 2 ELO change. So we don't have the resources to test like we wish we could.

When we did the "progressive" mobility as I described above, it was a major improvement for us, we could see that with only a small number of games but that is not typical, most of our improvements are difficult to measure.

There are some automated means to try to zero in on the right values, we have experimented with that in the past but we still rely mostly on intuition guided weight tuning.

Did that answer your question or did I understand your question incorrectly?

Bonjour Don,

Pretty much, except for one thing. I guessed that was what was being done, the same old tinkering tuning with a bit of testing that has been the norm for a long time. Clearly some people test way more extensively than others though.

What I really wanted to know was HOW MUCH TIME does it take to test an evaluation feature tweak that involved a small(?) change to the code and a new weight? How long before you would have confidence enough to include the tweak (or throw it away)? At 100,000 times 1 minute games that would appear to be 70 days per weight CHANGE, so therefore way longer to zero in on the best weight implementation?

So, for some relative simple term like mobility, let's say the base and straightforward implementation would be as per wiki: add up the pseudo-legals and choose a suitable weight to apply per legal move. Tuning that alone takes some time, right? How long?

Then if we decide on any or all of the possible tweaks mentioned: more bonus for forward moves, more bonus for centre square hits, relative bonus changes for very low mobility situations, not including enemy pawn covers, including own piece hits etc etc etc etc., in order to this beyond just tinkering and messing around, ie to do it with some degree of accuracy and REAL improvement by proper testing; how much more time do we then need?

I ask because it seems to me that if your program is relatively young, the obvious thing to do is to implement the base and straightforward evaluation sub functions at relatively simple level, because your limited time available won't allow you to properly test any more than those?

diep · Post by **diep** » Thu May 03, 2012 11:16 am

Don wrote:
diep wrote:
Don wrote:
Mincho Georgiev wrote:Unfortunately, is heavily weights dependent.
I've tested the following schemes:

1. "real mobility" - no LVA attacked squares - with increments.
2. -||- - no pawn attacked squares - increments.
2. mobility with increments including capturing squares /all captures/.
3. mobility without occupied squares.
4. "space" - all squares regardless of the attack on each one of them - with no increments, but collective bonus instead.

For me, No.4 works best. It could be just a good balancing regarding the rest of the evaluation terms tough, I'm pretty sure - it's all about the weights and that makes it so hard to find the best one - needs A LOT of testing.
Plus to No.4 - I'm testing for center control and wide center control. It gives me good results, combined with the collective "space" bonuses/penalties.
The problem with any way of doing evaluation is that you cannot be sure that you implemented it the best way. Komodo's is relatively elaborate, but who is to say it's any good? It could be that it works because we are missing other important terms or because we did not set the weights right for previous simpler attempts. All evaluation is like this, it's really a black art.
The problem is measuring whether something works.

Maybe 1 million games at say 5 minutes all?
Indeed, that is a big problem. We have to be happy with 50,000 games for most of our measurements at fast levels - and fast levels do not always come out the same. If the change helps significantly you can prove it's better without that many games.

With a huge evaluation function modifying 1 small thing somewhere, forget about that. If every line of code in diep's eval would be giving 1 elopoint it would be really strong you know

Of course bugs that lobotomize you are another question, yet if you have a range of bugs, fixing just one again doesn't give much improvement

That's not 50k games though, start thinking in the millions.

Ferdy · Post by **Ferdy** » Thu May 03, 2012 11:40 am

hgm wrote:What is the current consensus on mobility evaluation?

I have seen that some programs just count legal moves (weighted by piece type), others count only moves to squares not controlled by enemy Pawns, while still others only count forward moves of some pieces. Is there a way that is considered 'best'?

Not sure if this is best but hitting those squares near opponent's king location works for me.

hgm wrote: From my piece-value measurements I know that forward moves on a piece are typically worth twice as much as backwards or sideway moves, so weighting them differently could make sense. (Only counting forward moves seems a bit extreme, though.) An alternative, which on average would achieve the same, would be to weigth by target square. If squares on central ranks are weighted more, this favors forward mobility, as pieces usually can only be safely kept on your own half of the board. (White and black weighting of the same square can of course be made different.)

Agree with this, target squares should be weighted, in addition to center squares, you have 7th rank squares, squares near opp king, squares on open files - these are access to enemy territories, squares on half-open files, attacks on F3/C3, F6/C6 squares, square holes along 3rd/6th ranks-these are square not protected by pawns.

hgm wrote: I also wondered if there is a rationale for excluding only squares controlled by Pawns, as opposed to excluding squares controlled by any enemy piece of lower value. I guess mobility can also be looked upon as 'board control', and attacking a square with a Rook, even if it is protected by a Knight, still increases your control over that square (it prevents the opponent from entering it with a Queen, and it would allow you to enter it with a minor). Squares controlled by enemy Pawns can never be entered by your pieces, however, no matter how often you attack them. But attacking such squares could still prevent the opponent from entering them with a higher piece. So they might deserve to carry some (small) weigth.

I am working now on crazyhouse, and am actually counting who has the better control of every square, a small bonus is given if attacker is more than defender - this is still course but the plan is to consider the value of attackers and defenders. The drops are crazy.

hgm wrote: I was planning to implement this by taking counts of each piece type that could reach a square (as a sort of material index of the material that reaches it), so that I can use a lookup table to translate that material to score, so that I can basically use any weighting scheme without requiring any additional computational effort.

Nice

.

Ferdy · Post by **Ferdy** » Thu May 03, 2012 12:27 pm

chrisw wrote:
hgm wrote:
Don wrote:The table is non-linear so that having some mobility is more important than having a lot. In other words if one bishop is highly mobile and the other has little mobility it is much more important to help the bishop with low mobility.
I have always felt this way too, even to the point where I considered scoring different directions of the same piece that way. E.g. a Rook that has 2 moves along a file and 2 along a rank is better than a Rook with 4 moves along a file and 0 along the rank. Unfortunately the way I plan to implement mobility now makes it kind of difficult to implement such non-linearity.

If you look upon mobility as board control, however, such non-linearity makes less sense. For that purpose it would be more logical to punish 'over-contolling' of squares, e.g. aiming 5 pieces at the same square, amongst which a Pawn, while the opponent only has a single attack on it. It would probably have been better to divert part of the resources to controlling two other, now uncontrolled squares, i.e. leave this one for the Pawn, and direct two pieces to each of two other squares. This kind of non-linearity (per square rather than per piece) would be easier to do in the calculational scheme I have in mind.
Well, yes, but there is concept of overprotection. Often the best and often a game winning strategy. I guess the problem is that if we define particular eval subfunctions, eg mobility, we can get to the point that the myriads of possible tweaks of a particular sub function can bring it within the scope of another or an undefined subfunction.

Somehow overprotection is only applicable for humans.

chrisw wrote: I'm not really that familiar anymore with recent developments, for example multiple game testing of tweaks. Suppose you were to implement your idea above, how many games, and importantly, how much time would a developer need to proof the idea and what size of weighting it should be given?

Run a minimum of 5k games at say TC 40moves/20sec. Run bayeselo program and compare with the old version, if new version leads by an elo amount outside the error bar, that should be it, you may accept the change and stop the test. Say 1 game is completed in 60sec. Time will be 5000x60sec is 83.3 hours. In a quad computer using 3 cores simultaneously that would take 83.3hr/3 is 27.8 hrs or 1 day and 4 hrs.
If new version does not lead, it is probably time to tweak values assigned, or change couple of conditions. Best way is to review couple of games if you can see some patterns that is not good. A tool should be created to examine the games pointing out types of blunder moves, heavy losses of material, or quick loses patterns.
If the lead of new version is within error bar, add another games say 5k again. Sometimes you accept the change even it there is no visible improvement elo-wise but happy of how the new version plays. You may verify the change using longer TC. If I use preliminary test of TC 40moves/5sec I will verify using longer TC.
There is also a program called CLOP by Remi, for optimizing parameters of anything, say mobility, together with cutechess-cli by Ilari and Arto, just run as many games the more games the more reliable are the parameter values generated. You can stop the run and continue another day everything are saved.

PK · Post by PK » Fri May 04, 2012 2:11 pm

Inspired by this thread I did some tests for Rodent (already using non-linear mobility). Results are somewhat funny and very piece-dependent:

- excluding squares controlled by enemy pawns helps for knights (big time) and bishops (a bit), hurts for majors.
- forward mobility proved useful for bishops and rooks, but hurts me for knights and queens. As for knights, it might be redundant with my piece-square table.
- there's a gain in making queen "transparent" to bishop, but it may be based just on speed (no recalculation for king safety eval). there's also a gain in making major pieces transparent along rays. come to think of it, I blur Don Dailey's distinction between "mobility" and "range" more and more.

Dan Honeycutt · Post by **Dan Honeycutt** » Fri May 04, 2012 8:58 pm

diep wrote:
Don wrote:Indeed, that is a big problem. We have to be happy with 50,000 games for most of our measurements at fast levels - and fast levels do not always come out the same. If the change helps significantly you can prove it's better without that many games.
With a huge evaluation function modifying 1 small thing somewhere, forget about that. If every line of code in diep's eval would be giving 1 elopoint it would be really strong you know

Of course bugs that lobotomize you are another question, yet if you have a range of bugs, fixing just one again doesn't give much improvement

That's not 50k games though, start thinking in the millions.

The improvement given depends on the bug. In my first few releases of Bruja I had a bug in my quiescent search where I threw out the good captures and searched the bad ones. It only took about 3 or 4 games to determine that fixing that was a significant improvement.

Best
Dan H.

Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval

Re: Mobility eval