Passed Pawns (endgame)

bob · Post by **bob** » Mon May 24, 2010 1:49 am

rvida wrote:
bob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...
With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotuner )

P.S.: sorry for my horrible english

First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...

rvida · Post by **rvida** » Mon May 24, 2010 2:04 am

bob wrote:
rvida wrote:
bob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...
With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotuner )

P.S.: sorry for my horrible english
First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...

It depends. With tiny changes in eval maybe I will take a result from 20000 games with game/1 sec TC. But in such fast games some search features would never kick in. For example in Critter IID behaves differently with depth >= 12 plies. Singular extensions are done only if depth >= 10 plies. Null-move pruning behaves differently with high depths. If I want to tune some constants as "IID_Margin" or "SingularMoveMargin" I definitely need to play at longer TC. Something reaching at least depth 14-15 plies.

bob · Post by **bob** » Mon May 24, 2010 2:13 am

rvida wrote:
bob wrote:
rvida wrote:
bob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...
With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotuner )

P.S.: sorry for my horrible english
First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...
It depends. With tiny changes in eval maybe I will take a result from 20000 games with game/1 sec TC. But in such fast games some search features would never kick in. For example in Critter IID behaves differently with depth >= 12 plies. Singular extensions are done only if depth >= 10 plies. Null-move pruning behaves differently with high depths. If I want to tune some constants as "IID_Margin" or "SingularMoveMargin" I definitely need to play at longer TC. Something reaching at least depth 14-15 plies.

First, as I mentioned, if you are doing _search_ changes, you do probably want to test at different time controls. Although in the testing I have been doing, >80% of the very fast time control games mimic longer games (with respect to testing search changes). For example, in the past year, I reworked null-move search, futility pruning, reductions, and then tested those changes at very fast time controls, then at up to 60+60 games, and the results were consistent. I can't see any reason why IID would depend on depth. It's a recursive algorithm and what works at one time control should work at all time controls. There are a few things that might violate this, but based on a ton of testing, most search things can be safely tested at short time controls.

Second, at least for me, _most_ changes are not search changes, but are evaluation changes. These are almost perfectly measurable with fast time controls...

Ralph Stoesser · Post by **Ralph Stoesser** » Mon May 24, 2010 11:06 am

zamar wrote:
Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.

It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".

I'm not sure what it means, but probably it does not mean "I completely agree".
I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.

I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.

At least we are not talking about voodoo, so it should be possible to clarify things.

zamar · Post by **zamar** » Mon May 24, 2010 12:30 pm

Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.

I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.

Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.

Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.

But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.

And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.

Ralph Stoesser · Post by **Ralph Stoesser** » Mon May 24, 2010 1:06 pm

zamar wrote:
Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.
I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.

Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.

Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.

But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.

And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.

I remember after the release of SF 1.7 you (or Marco) said something like: "Don't expect an Elo increase from SF 1.7". Now SF 1.7 is clearly stronger than SF 1.6.3. I guess you have not played 1000 games @1min/game against SF 1.6.3 to very SF 1.7's strength?

I don't understand your distinctions w.r.t eval. Anyway, it's difficult to argue in a theoretically manner about this issue. There are so many testers around. I wonder why we have no 1 sec rating list yet.

lech · Post by **lech** » Mon May 24, 2010 4:01 pm

Marco and Ralph, thanks for explains of assert function. Now I understand it.
I never used debuggers.

I am happy that my joke (a catch) with “ELO >>

“ was correctly read.

Of course, both “assert” functions should be removed (?).

Code: Select all

Bitboard b = ei.pi->passed_pawns() & pos.pieces(PAWN, Us); 
    while (b) 
    { 
        Square s = pop_1st_bit(&b); 
        assert(pos.piece_on(s) == piece_of_color_and_type(Us, PAWN)); 
        assert(pos.pawn_is_passed(Us, s));

BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).

Ralph Stoesser · Post by **Ralph Stoesser** » Mon May 24, 2010 5:40 pm

lech wrote: BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).

Do you mean something like

Code: Select all

move 1
depth  1 score 3.00
depth  2 score 3.00
depth  3 score 3.00
...
depth 30 score 3.00

move 2
depth  1 score 0.10
depth  2 score 0.20
depth  3 score 0.30
...
depth 30 score 2.50

Then we should choose move 2 instead of move 1?
I've thought about it recently. It could be an interesting approach to try to solve the thread issue.

lech · Post by **lech** » Mon May 24, 2010 6:15 pm

Ralph Stoesser wrote:
lech wrote: BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).
Do you mean something like
Code: Select all
move 1
depth  1 score 3.00
depth  2 score 3.00
depth  3 score 3.00
...
depth 30 score 3.00

move 2
depth  1 score 0.10
depth  2 score 0.20
depth  3 score 0.30
...
depth 30 score 2.50
Then we should choose move 2 instead of move 1?
I've thought about it recently. It could be an interesting approach to try to solve the thread issue.

Thanks Ralph, it means that some tries were not done, or gave a bad result. OK.

bob · Post by **bob** » Mon May 24, 2010 6:23 pm

zamar wrote:
Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.
I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.

Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.

Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.

But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.

And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.

I don't follow those comments. Crafty's king-safety is "2nd order". There is interaction between pieces. Futility pruning. Razoring. LMR. Even something beyond "futility pruning". Null-move. qsearch checks. Passed pawn evaluation. The point is, we didn't just "jump into this fast testing." We ran tens of millions of games to determine whether this would work or not. If one is going to use a testing methodology, one needs to be sure that the methodology is valid.

One thing I can say with absolute certainty, given the choice of 30,000 fast games or 1,000 slow games, I'll take the fast games _every_ time. The +/-10 error bar on 1,000 games is just too large. Given the choice of 30,000 fast or 30,000 slow games, I'd prefer the slow games, all else being equal. But things are not equal in terms of time, which is an important consideration here.

Far be it from me to try to convince you to change your testing approach, but at least don't try to justify 1,000 games as better solely because the games are at a slower time control.

Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)