First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...rvida wrote:With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotunerbob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...)
P.S.: sorry for my horrible english
Passed Pawns (endgame)
Moderator: Ras
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Passed Pawns (endgame)
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: Passed Pawns (endgame)
It depends. With tiny changes in eval maybe I will take a result from 20000 games with game/1 sec TC. But in such fast games some search features would never kick in. For example in Critter IID behaves differently with depth >= 12 plies. Singular extensions are done only if depth >= 10 plies. Null-move pruning behaves differently with high depths. If I want to tune some constants as "IID_Margin" or "SingularMoveMargin" I definitely need to play at longer TC. Something reaching at least depth 14-15 plies.bob wrote:First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...rvida wrote:With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotunerbob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...)
P.S.: sorry for my horrible english
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Passed Pawns (endgame)
First, as I mentioned, if you are doing _search_ changes, you do probably want to test at different time controls. Although in the testing I have been doing, >80% of the very fast time control games mimic longer games (with respect to testing search changes). For example, in the past year, I reworked null-move search, futility pruning, reductions, and then tested those changes at very fast time controls, then at up to 60+60 games, and the results were consistent. I can't see any reason why IID would depend on depth. It's a recursive algorithm and what works at one time control should work at all time controls. There are a few things that might violate this, but based on a ton of testing, most search things can be safely tested at short time controls.rvida wrote:It depends. With tiny changes in eval maybe I will take a result from 20000 games with game/1 sec TC. But in such fast games some search features would never kick in. For example in Critter IID behaves differently with depth >= 12 plies. Singular extensions are done only if depth >= 10 plies. Null-move pruning behaves differently with high depths. If I want to tune some constants as "IID_Margin" or "SingularMoveMargin" I definitely need to play at longer TC. Something reaching at least depth 14-15 plies.bob wrote:First, give this some thought. How long does it take to play a game/1sec? 60 per minute? 3600 per hour? 80,000 in 24 hours? It is not that hard to do rigorous testing. You might have to compromise on the time control, but apparently (based on Larry Kaufman's comments) Rybka has been extensively tuned at 40K games overnight for a year+...rvida wrote:With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotunerbob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...)
P.S.: sorry for my horrible english
Second, at least for me, _most_ changes are not search changes, but are evaluation changes. These are almost perfectly measurable with fast time controls...
-
- Posts: 408
- Joined: Sat Mar 06, 2010 9:28 am
Re: Passed Pawns (endgame)
I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.zamar wrote:I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.
It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".
I'm not sure what it means, but probably it does not mean "I completely agree".

At least we are not talking about voodoo, so it should be possible to clarify things.
-
- Posts: 613
- Joined: Sun Jan 18, 2009 7:03 am
Re: Passed Pawns (endgame)
I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.![]()
Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.
Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.
But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.
And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.
Joona Kiiski
-
- Posts: 408
- Joined: Sat Mar 06, 2010 9:28 am
Re: Passed Pawns (endgame)
I remember after the release of SF 1.7 you (or Marco) said something like: "Don't expect an Elo increase from SF 1.7". Now SF 1.7 is clearly stronger than SF 1.6.3. I guess you have not played 1000 games @1min/game against SF 1.6.3 to very SF 1.7's strength?zamar wrote:I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.![]()
Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.
Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.
But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.
And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.
I don't understand your distinctions w.r.t eval. Anyway, it's difficult to argue in a theoretically manner about this issue. There are so many testers around. I wonder why we have no 1 sec rating list yet.
-
- Posts: 1169
- Joined: Sun Feb 14, 2010 10:02 pm
Re: Passed Pawns (endgame)
Marco and Ralph, thanks for explains of assert function. Now I understand it.
I never used debuggers.
I am happy that my joke (a catch) with “ELO >>
“ was correctly read.
Of course, both “assert” functions should be removed (?).
BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).
I never used debuggers.

I am happy that my joke (a catch) with “ELO >>


Of course, both “assert” functions should be removed (?).
Code: Select all
Bitboard b = ei.pi->passed_pawns() & pos.pieces(PAWN, Us);
while (b)
{
Square s = pop_1st_bit(&b);
assert(pos.piece_on(s) == piece_of_color_and_type(Us, PAWN));
assert(pos.pawn_is_passed(Us, s));
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).
-
- Posts: 408
- Joined: Sat Mar 06, 2010 9:28 am
Re: Passed Pawns (endgame)
Do you mean something likelech wrote: BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).
Code: Select all
move 1
depth 1 score 3.00
depth 2 score 3.00
depth 3 score 3.00
...
depth 30 score 3.00
move 2
depth 1 score 0.10
depth 2 score 0.20
depth 3 score 0.30
...
depth 30 score 2.50
I've thought about it recently. It could be an interesting approach to try to solve the thread issue.
-
- Posts: 1169
- Joined: Sun Feb 14, 2010 10:02 pm
Re: Passed Pawns (endgame)
Thanks Ralph, it means that some tries were not done, or gave a bad result. OK.Ralph Stoesser wrote:Do you mean something likelech wrote: BTW: Were there some works (tries) to switch “search” between “score” and “increase of score” ?
I think it could well work in endings. Only sacrifices (captures) seem to be a problem (separate tree?).
Then we should choose move 2 instead of move 1?Code: Select all
move 1 depth 1 score 3.00 depth 2 score 3.00 depth 3 score 3.00 ... depth 30 score 3.00 move 2 depth 1 score 0.10 depth 2 score 0.20 depth 3 score 0.30 ... depth 30 score 2.50
I've thought about it recently. It could be an interesting approach to try to solve the thread issue.

-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Passed Pawns (endgame)
I don't follow those comments. Crafty's king-safety is "2nd order". There is interaction between pieces. Futility pruning. Razoring. LMR. Even something beyond "futility pruning". Null-move. qsearch checks. Passed pawn evaluation. The point is, we didn't just "jump into this fast testing." We ran tens of millions of games to determine whether this would work or not. If one is going to use a testing methodology, one needs to be sure that the methodology is valid.zamar wrote:I'd be very interested in research in this area, but I'm not prepared to stall the development of Stockfish for weeks or months just to test this thing.Ralph Stoesser wrote: I really would like to see a replicable example where a tiny eval change is not detectable by playing (a series of) 20000 games @ 1 sec, but by playing (a series of) 1000 games @ 1 min. I'm far from beeing experienced at testing, but from my very first trials with 1000 games @1 min I felt quite misguided.![]()
Bob has done a lot of research with Crafty in this area, but things which are true with Crafty's quite homogenous search trees, are not necessarily true with Stockfish very imbalanced search trees.
Of course I'd expect that in simple things like psqt, piece values, mobility, pawn structure, time controls doesn't play a big role.
But when it comes to king's safety, passed pawn evaluation, static threat scoring, second order material evaluation, it's far from clear.
And when it comes to search, it's clear that longer time controls are needed. We have many examples of how increased pruning is good at short time controls, but counter-productive at longer time controls.
One thing I can say with absolute certainty, given the choice of 30,000 fast games or 1,000 slow games, I'll take the fast games _every_ time. The +/-10 error bar on 1,000 games is just too large. Given the choice of 30,000 fast or 30,000 slow games, I'd prefer the slow games, all else being equal. But things are not equal in terms of time, which is an important consideration here.
Far be it from me to try to convince you to change your testing approach, but at least don't try to justify 1,000 games as better solely because the games are at a slower time control.