YAFTS - Yet Another Fast Testing Scheme

bob · Post by **bob** » Mon Apr 20, 2009 11:57 pm

MattieShoes wrote:That's kind of what I was getting at. They went through a lot of work to make their eval tuning work, and the paper details some of the pitfalls, like culling positions where the chosen move is wildly different in score than the "best" move, and how deeper searches yield better results. The functions they were using to measure quality of eval could be used to rank quality of different engines just as easily.

They also point out that the tuning helped but the most "tuned" versions underperformed. I'm guessing the eval was getting the right answers for the wrong reasons then, so even with care, you're likely to get outliers, where their strength is not well represented by their score.

The problem with almost all test positions is that they require both an evaluation-specific thing, and a tree search. You want the former to be critical, and the latter to be irrelevant for this kind of testing. Pushing a pawn to "cement" a black pawn on a weak square is something an evaluation could find on its own. Or if it is not as clever, it could depend on a short tree search to see that the pawn is sorta-weak, and if the opponent can push it and trade it quickly the weakness goes away, while of we play to prevent its movement, the weakness stays.

If you want to tune your evaluation, you need to tune against positions where the evaluation is known and no search is required. Then you can _really_ tune your eval to adjust the score to match what is known to be correct. But test positions rarely do that, they require search and evaluation together, and often search can compensate for mis-evaluations, and evaluation tuning is irrelevant in such positions.

I gave up on this type of testing years ago. Even playing fast games can be very misleading about a change. You can make a program very aggressive with passed pawn pushes and a shallow-searching opponent will get into trouble. And the aggressive pawn pushing looks good. But in longer games, all it does is advance the pawn where it is easier to capture, and it can look much worse...

That's why I test with fast games, and verify with slow games. And occasionally even retest with slow games when fast games look bad, just to be sure than a change that looks good intuitively but shows up as bad in fast games, is also bad in longer games as well.

bob · Post by **bob** » Mon Apr 20, 2009 11:59 pm

Uri Blass wrote:
diep wrote:
Uri Blass wrote:Here is an interesting variant of the test.

Use exactly the same positions but give rybka3 to search 1 hours on every position(it will take 4830/24 days of computer time).

Decide that the the programs need to find not the game's move but rybka's move.
I guess that the test may be good to predict rating of chess programs when you ignore rybka.

I also suspect that
optimizing the evaluation to find after one second as much as possible of what rybka can find in one hour may cause improvement in chess programs including improvement in rybka(4830 positions may be too small as Don suggest in one of his posts in this thread).

Uri
Problem of your test method is that you favour engines with little chessknowledge over engines with a lot of chessknowledge. Marc is doing it a lot better there.

Rybka as an engine is good in avoiding mistakes, that is not the same as finding the superior positional move that distinguishes 2600 rated corr players from 2400 guys who just take over what their TFT shows.

You should give Marc credit that he managed to have found yet another method to prove that chessknowledge works.
I disagree

I have iccf rating of more than 2600 and I rely mainly on chess engines(of course I used average time of more than one hour per move).
most of my moves that helped me to get iccf rating above 2600 were result of long analysis of chess engines and it was before rybka.

I do not believe in the theory that rybka is only good in avoiding mistakes and I believe that rybka is simply good in finding better positional moves
if you use it for a long time.

I have one example from analyzing some quiet theory position in the spanish defence with all the 32 pieces on the board when rybka changed her mind to the theory move after a long time.

Uri

I personally believe that the magic of Rybka is _not_ in the evaluation. It is in the search. This is both from watching it play many games, as well as considering the obfuscation dealing with nodes and depth. One would not need to obfuscate depth/nodes if the significant part is in evaluation... Rybka has simply found a way to either extend the right moves, or not reduce the right moves, so that its search is spending more effort where it is important...

MattieShoes · Post by **MattieShoes** » Tue Apr 21, 2009 2:47 am

I understand what you're saying and it makes perfect sense to me, but the deep thought paper seems to suggest otherwise...

We observed that deeper searches in the tuning code lead to better results, even though it could be argued that evaluations should be orthogonal to searching. Perhaps an explanation for this effect is that deeper searches lead to more differences in the positions that are being related because they are more moves apart. Therefore, the tuning process collects more information on how individual components of the evaluation relate to each other.

I'm guessing the problem is the position from their GM games is very different then an average position an engine search encounters. So perhaps a deeper search gives you more of those positions that you'll never find in your GM games while still retaining your "oracle" at the root.

Anyway, I wasn't trying to make a case for eval tuning, or argue the fine points of it as I am far too ignorant on the subject. I was just suggesting the same problems encountered in automated eval tuning could be present in an EPD-based rating approximation system, and the same solutions might work too.

In the end, I think we're all in agreement regarding the relative value of rating engines based on an EPD performance, we're just looking at it from different ends. An EPD-based rating system is going to be inferior to berjillions of games vs a berjillions of opponents for estimating strength. But if we can be satisfied with a rough approximation, it may still have value... At least to those of us without the ability to run several thousand games an hour

Actually, the inverse idea seems more interesting to me though. Rather than using a bunch of positions to approximate the known rating of a few engines, run a berjillion engines with known ratings through positions and create a result distribution for *each position* based on those known ratings. You could have all sorts of fun with that information.

bob · Post by **bob** » Tue Apr 21, 2009 3:50 am

MattieShoes wrote:I understand what you're saying and it makes perfect sense to me, but the deep thought paper seems to suggest otherwise...

We observed that deeper searches in the tuning code lead to better results, even though it could be argued that evaluations should be orthogonal to searching. Perhaps an explanation for this effect is that deeper searches lead to more differences in the positions that are being related because they are more moves apart. Therefore, the tuning process collects more information on how individual components of the evaluation relate to each other.

I completely agree that a deeper search is better. But then you are testing in two orthogonal directions, as a search improvement can produce more benefit that the positional improvement.

I talked to Hsu/Campbell at length at an ACM event about this process. I believe that if the root position is absolutely non-tactical, then it is possible to tune to a set of positions and do pretty well, although you are tuning that program, to that specific hardware/speed, which in today's world is problematic. I've run on a quad 8-core test box at 100M nodes per second. I've run on single CPU boxes at under 1M nodes per second. Those extremes produce significantly different search depths.

This is all about choosing the right set of positions, which is a very time-consuming process...

I'm guessing the problem is the position from their GM games is very different then an average position an engine search encounters. So perhaps a deeper search gives you more of those positions that you'll never find in your GM games while still retaining your "oracle" at the root.

Anyway, I wasn't trying to make a case for eval tuning, or argue the fine points of it as I am far too ignorant on the subject. I was just suggesting the same problems encountered in automated eval tuning could be present in an EPD-based rating approximation system, and the same solutions might work too.

In the end, I think we're all in agreement regarding the relative value of rating engines based on an EPD performance, we're just looking at it from different ends. An EPD-based rating system is going to be inferior to berjillions of games vs a berjillions of opponents for estimating strength. But if we can be satisfied with a rough approximation, it may still have value... At least to those of us without the ability to run several thousand games an hour

Actually, the inverse idea seems more interesting to me though. Rather than using a bunch of positions to approximate the known rating of a few engines, run a berjillion engines with known ratings through positions and create a result distribution for *each position* based on those known ratings. You could have all sorts of fun with that information.

Uri Blass · Post by **Uri Blass** » Tue Apr 21, 2009 9:33 am

bob wrote:
Uri Blass wrote:
diep wrote:
Uri Blass wrote:Here is an interesting variant of the test.

Use exactly the same positions but give rybka3 to search 1 hours on every position(it will take 4830/24 days of computer time).

Decide that the the programs need to find not the game's move but rybka's move.
I guess that the test may be good to predict rating of chess programs when you ignore rybka.

I also suspect that
optimizing the evaluation to find after one second as much as possible of what rybka can find in one hour may cause improvement in chess programs including improvement in rybka(4830 positions may be too small as Don suggest in one of his posts in this thread).

Uri
Problem of your test method is that you favour engines with little chessknowledge over engines with a lot of chessknowledge. Marc is doing it a lot better there.

Rybka as an engine is good in avoiding mistakes, that is not the same as finding the superior positional move that distinguishes 2600 rated corr players from 2400 guys who just take over what their TFT shows.

You should give Marc credit that he managed to have found yet another method to prove that chessknowledge works.
I disagree

I have iccf rating of more than 2600 and I rely mainly on chess engines(of course I used average time of more than one hour per move).
most of my moves that helped me to get iccf rating above 2600 were result of long analysis of chess engines and it was before rybka.

I do not believe in the theory that rybka is only good in avoiding mistakes and I believe that rybka is simply good in finding better positional moves
if you use it for a long time.

I have one example from analyzing some quiet theory position in the spanish defence with all the 32 pieces on the board when rybka changed her mind to the theory move after a long time.

Uri
I personally believe that the magic of Rybka is _not_ in the evaluation. It is in the search. This is both from watching it play many games, as well as considering the obfuscation dealing with nodes and depth. One would not need to obfuscate depth/nodes if the significant part is in evaluation... Rybka has simply found a way to either extend the right moves, or not reduce the right moves, so that its search is spending more effort where it is important...

I believe that rybka is simply better in both search and evaluation but the reason that rybka is better is not important.

I guess that
tuning your evaluation to find the moves that a significantly better player suggests or tuning your search to find the same moves can lead to improvement.

Significantly better player can be rybka3 at long time control or correspondence players.

Uri

mcostalba · Post by **mcostalba** » Tue Apr 21, 2009 9:34 pm

bob wrote: Rybka has simply found a way to either extend the right moves, or not reduce the right moves, so that its search is spending more effort where it is important...

Or a way to prune more the bad moves, that seems equivalent to the above.

Regarding pruning I would like to ask a comment from you on testing a pruning patch, a change that increase pruning.

I have found this process very tricky. It is very difficult to say if a pruning change that "seems" good also quite good is really good.

Pruning seems one field where time control used is critical and very misleading. What is your expirience on this? How you come to trust a modification that increases pruning?

You use longer time controls? You test only against different engines then Crafty current version? You set a minimum ELO thresold to gain to be accepted? You use only your intuition/experience? Or something else?

Thanks
Marco

diep · Post by **diep** » Wed Apr 22, 2009 1:46 am

Uri,

i have a simple question for you to start with:

a) Is chess a game where only the worst case counts or does only the best case count?

b) Are humans having a better worst case than computers?

c) Are humans having a better best case than computers?

d) can we replace the word 'best case' by "finding better positional moves" and 'worst case' by "playing more accurate"

e) do you agree that it is possible to distinguish in chess (behind it examples) :
a) tactics (like winning material, or a passed pawn that is hung)
b) positional considerations (strong bishop)
c) strategic plan (in this position the plan is to attack)

Drawing abstract conclusions is only for an elite few isn't it?

Uri Blass wrote:
diep wrote:
Uri Blass wrote:Here is an interesting variant of the test.

Use exactly the same positions but give rybka3 to search 1 hours on every position(it will take 4830/24 days of computer time).

Decide that the the programs need to find not the game's move but rybka's move.
I guess that the test may be good to predict rating of chess programs when you ignore rybka.

I also suspect that
optimizing the evaluation to find after one second as much as possible of what rybka can find in one hour may cause improvement in chess programs including improvement in rybka(4830 positions may be too small as Don suggest in one of his posts in this thread).

Uri
Problem of your test method is that you favour engines with little chessknowledge over engines with a lot of chessknowledge. Marc is doing it a lot better there.

Rybka as an engine is good in avoiding mistakes, that is not the same as finding the superior positional move that distinguishes 2600 rated corr players from 2400 guys who just take over what their TFT shows.

You should give Marc credit that he managed to have found yet another method to prove that chessknowledge works.
I disagree

I have iccf rating of more than 2600 and I rely mainly on chess engines(of course I used average time of more than one hour per move).
most of my moves that helped me to get iccf rating above 2600 were result of long analysis of chess engines and it was before rybka.

I do not believe in the theory that rybka is only good in avoiding mistakes and I believe that rybka is simply good in finding better positional moves
if you use it for a long time.

I have one example from analyzing some quiet theory position in the spanish defence with all the 32 pieces on the board when rybka changed her mind to the theory move after a long time.

Uri

bob · Post by **bob** » Wed Apr 22, 2009 1:50 am

mcostalba wrote:
bob wrote: Rybka has simply found a way to either extend the right moves, or not reduce the right moves, so that its search is spending more effort where it is important...
Or a way to prune more the bad moves, that seems equivalent to the above.

Regarding pruning I would like to ask a comment from you on testing a pruning patch, a change that increase pruning.

I have found this process very tricky. It is very difficult to say if a pruning change that "seems" good also quite good is really good.

Pruning seems one field where time control used is critical and very misleading. What is your expirience on this? How you come to trust a modification that increases pruning?

You use longer time controls? You test only against different engines then Crafty current version? You set a minimum ELO thresold to gain to be accepted? You use only your intuition/experience? Or something else?

Thanks
Marco

I vary the time control. And I sometimes test against the older version of Crafty, although I tend to stay away from that except for specific cases...

The main thing I do is play an absolute ton of games before I become convinced that a new idea works...

YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme

Re: YAFTS - Yet Another Fast Testing Scheme