Passed Pawns (endgame)

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 3:57 pm

Ralph Stoesser wrote: Ok, I will test it.

BTW: This PGO Makefile does not work for me. I have to add the compiler flag "-Wcoverage-mismatch" in the gcc-profile-use section to avoid error: "corrupted profile info: profile data is not flow-consistent". (g++ 4.4.3, Linux Ubuntu 10.04 LTS 64bit)

No sorry, it works now independent from the -Wcoverage-mismatch flag, but not everytime. Sometimes I get error messages like above in the profile-use step, sometimes not.
Is this because of pthreads usage? It's my first PGO compile trial. Also I have no Intel Compiler installed.

mcostalba · Post by **mcostalba** » Sun May 23, 2010 5:48 pm

Ralph Stoesser wrote:
Ralph Stoesser wrote: Ok, I will test it.

BTW: This PGO Makefile does not work for me. I have to add the compiler flag "-Wcoverage-mismatch" in the gcc-profile-use section to avoid error: "corrupted profile info: profile data is not flow-consistent". (g++ 4.4.3, Linux Ubuntu 10.04 LTS 64bit)
No sorry, it works now independent from the -Wcoverage-mismatch flag, but not everytime. Sometimes I get error messages like above in the profile-use step, sometimes not.
Is this because of pthreads usage? It's my first PGO compile trial. Also I have no Intel Compiler installed.

This is broken under Linux.

You should change exe as:

Code: Select all

### Executable name. Do not change
EXE = ./stockfish

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 7:27 pm

mcostalba wrote:
Ralph Stoesser wrote:
Ralph Stoesser wrote: Ok, I will test it.

BTW: This PGO Makefile does not work for me. I have to add the compiler flag "-Wcoverage-mismatch" in the gcc-profile-use section to avoid error: "corrupted profile info: profile data is not flow-consistent". (g++ 4.4.3, Linux Ubuntu 10.04 LTS 64bit)
No sorry, it works now independent from the -Wcoverage-mismatch flag, but not everytime. Sometimes I get error messages like above in the profile-use step, sometimes not.
Is this because of pthreads usage? It's my first PGO compile trial. Also I have no Intel Compiler installed.
This is broken under Linux.

You should change exe as:
Code: Select all
### Executable name. Do not change
EXE = ./stockfish

This is not the problem. The problem is that "make gcc-profile" (i.e. the gcc-profile-use compile run) sometimes does not work on my system. Error msg see above.

BTW there was much more to change than the bit of code from your diff.

Here are the first results. Three bench runs for every version on my slow AMD machine.

Code: Select all

bench 128 1 12 default depth

SF test (PGO)
-------------------------------
bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 9957
Nodes searched  : 8299338
Nodes/second    : 833517

bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 10029
Nodes searched  : 8299338
Nodes/second    : 827533

bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 9915
Nodes searched  : 8299338
Nodes/second    : 837048
--------------------------------



SF default (PGO)
--------------------------------
bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 10069
Nodes searched  : 8299338
Nodes/second    : 824246

bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 10107
Nodes searched  : 8299338
Nodes/second    : 821147

bestmove c5d5 ponder d1d5
===============================
Total time (ms) : 10056
Nodes searched  : 8299338
Nodes/second    : 825312

Now I play a few thousand 1 sec games on my faster machine ...

bob · Post by **bob** » Sun May 23, 2010 7:43 pm

mcostalba wrote:
Ralph Stoesser wrote: I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
Ralph, please no offence, but for me definition of reliable is not: reliable = What I would like to see out of the results

I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.

P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.

This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.

Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...

All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.

mcostalba · Post by **mcostalba** » Sun May 23, 2010 7:49 pm

Ralph Stoesser wrote: BTW there was much more to change than the bit of code from your diff.

Yes I know, but I didn't want to ruin the surprise

From your node counts that are the same between SF default and modified I guess that you added a bunch of redundant code in generate_captures() and generate_noncaptures()

At the end of the test you may want to unify in something like

Code: Select all

template<Color Us, MoveType Type>
MoveStack* generate_moves(const Position& pos, MoveStack* mlist) {

  assert(pos.is_ok());
  assert(!pos.is_check());

  Bitboard target = (Type == CAPTURE ? pos.pieces_of_color(opposite_color(Us))
                                     : pos.empty_squares());

  mlist = generate_piece_moves<Us, PAWN, Type>(pos, mlist, target);
  mlist = generate_piece_moves<Us, KNIGHT>(pos, mlist, target);
  mlist = generate_piece_moves<Us, BISHOP>(pos, mlist, target);
  mlist = generate_piece_moves<Us, ROOK>(pos, mlist, target);
  mlist = generate_piece_moves<Us, QUEEN>(pos, mlist, target);
  mlist = generate_piece_moves<Us, KING>(pos, mlist, target);

  if (Type == NON_CAPTURE)
  {
      mlist = generate_castle_moves<KING_SIDE>(pos, mlist);
      mlist = generate_castle_moves<QUEEN_SIDE>(pos, mlist);
  }
  return mlist;
}

You will see node searched number to change because you generate captures in a different order....but this should not change ELO...I hope

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 8:20 pm

bob wrote:
mcostalba wrote:
Ralph Stoesser wrote: I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
Ralph, please no offence, but for me definition of reliable is not: reliable = What I would like to see out of the results

I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.

P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.
This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.

Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...

All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.

If this is true, and I'm a firm believer, it should be rather pointless to outvote a test with 20000 games against a test with 1000 games, especially in this case where opposite side castling must be possible or must have happened to find eval differences at all. I still assume 1000 games are probably not enough to reveal an effect, regardless of used time control.

Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 8:23 pm

mcostalba wrote:
Ralph Stoesser wrote: BTW there was much more to change than the bit of code from your diff.
Yes I know, but I didn't want to ruin the surprise

From your node counts that are the same between SF default and modified I guess that you added a bunch of redundant code in generate_captures() and generate_noncaptures()

At the end of the test you may want to unify in something like
Code: Select all
template<Color Us, MoveType Type>
MoveStack* generate_moves(const Position& pos, MoveStack* mlist) {

  assert(pos.is_ok());
  assert(!pos.is_check());

  Bitboard target = (Type == CAPTURE ? pos.pieces_of_color(opposite_color(Us))
                                     : pos.empty_squares());

  mlist = generate_piece_moves<Us, PAWN, Type>(pos, mlist, target);
  mlist = generate_piece_moves<Us, KNIGHT>(pos, mlist, target);
  mlist = generate_piece_moves<Us, BISHOP>(pos, mlist, target);
  mlist = generate_piece_moves<Us, ROOK>(pos, mlist, target);
  mlist = generate_piece_moves<Us, QUEEN>(pos, mlist, target);
  mlist = generate_piece_moves<Us, KING>(pos, mlist, target);

  if (Type == NON_CAPTURE)
  {
      mlist = generate_castle_moves<KING_SIDE>(pos, mlist);
      mlist = generate_castle_moves<QUEEN_SIDE>(pos, mlist);
  }
  return mlist;
}
You will see node searched number to change because you generate captures in a different order....but this should not change ELO...I hope

Have you tried to compile your code?
Show me a working patch please.

mcostalba · Post by **mcostalba** » Sun May 23, 2010 8:24 pm

Ralph Stoesser wrote: Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.

Ralph, I would need to verify at a longer time control, not at the same time control used by you !

If I verify at the same time control it means I don't trust your test, but this is not the case. What I don't trust is the TC used.

So I should verify playing 20000 games @1 minute and it is going to take a looong time

bob · Post by **bob** » Sun May 23, 2010 8:32 pm

Ralph Stoesser wrote:
bob wrote:
mcostalba wrote:
Ralph Stoesser wrote: I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
Ralph, please no offence, but for me definition of reliable is not: reliable = What I would like to see out of the results

I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.

P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.
This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.

Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...

All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.
If this is true, and I'm a firm believer, it should be rather pointless to outvote a test with 20000 games against a test with 1000 games, especially in this case where opposite side castling must be possible or must have happened to find eval differences at all. I still assume 1000 games are probably not enough to reveal an effect, regardless of used time control.

Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.

The only exceptions I have found to this process revolve around two themes:

(1) time allocation. Changes where you use more time, or alter the way time is allocated to each move, extending the time on fail-lows or other cosiderations, etc. You have to check representative time controls to be sure that things work the same when you have more time vs less time.

(2) search changes where you might see an exponential problem pop up. For example, in 1994 we had some odd failures in Cray Blitz due to singular extensions. Our testing was on slow hardware, but we played (in 1994) on hardware that would peak at about 7M nps. The significant extra depth led to many more (than expected) singular extensions. So things that might well cause tree explosion at deeper depths (extensions, reductions, etc) need to be verified at longer time controls. But in our testing, which is now beyond the 100M game level, most of these changes remain consistent.

Note that by consistent, I mean that A and A' (original and modified) show about the same "gap" (in terms of Elo) across all time controls. I have lots of examples of where program A does worse against B (two different programs) at different time controls. But with the same program, just two versions, this has not been a problem. A and A' might do worse against B at fast times, vs slow times, but if A' is better than A, it will be consistently better so that measuring the Elo gap between them produces a near-constant number.

1000 games has such a high error bar, unless the change is dramatic in nature, such a match will produce noise but no usable results. If you think a change is in the 2-3-4 Elo range, you need to produce between 40,000 and 100,000 games.

By the way, the 1,000 slow games vs 20,000 fast games myth is worthy of "The Myth Busters" TV program. It sounds plausible, but is far from it in reality. Yet we will continue to see this over and over.

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 8:38 pm

mcostalba wrote:
Ralph Stoesser wrote: Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.
Ralph, I would need to verify at a longer time control, not at the same time control used by you !

If I verify at the same time control it means I don't trust your test, but this is not the case. What I don't trust is the TC used.

So I should verify playing 20000 games @1 minute and it is going to take a looong time

In case your test would reveal the same result: 200-300 hundred points ahead for the correct array values. How would you explain this?

And what if a third test would reveal the same?

Some unknown kind of directed noise of undeterminated origin?

Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)