another interesting cluster test result

Zach Wegner · Post by **Zach Wegner** » Thu Oct 15, 2009 5:47 pm

Gian-Carlo Pascutto wrote:
jwes wrote:So doing null-move at PV nodes saves an insignificant number of nodes while very rarely causing a bad move from a null-move false positive, i.e. in the last round of an important tournament.
This reasoning is completely ignorant of the statistics Bob already presented.

The results from Bob show that even if this change would cause randomly to throw one game in 5, it causes a strength increase in the 4 other games enough to completely offset and overcome this loss.

What's really happening is probably not quite so extreme, but the reasoning you give is flawed nevertheless: the results already show that the program is no weaker with this change.

Well, I am thinking along the same lines as he is. Because whether you null move or not in PV nodes doesn't matter for strength, I would not do it for aesthetic reasons. I think that's all he means.

bob · Post by **bob** » Thu Oct 15, 2009 6:02 pm

mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.

The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.

Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.

Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.

mhull · Post by **mhull** » Thu Oct 15, 2009 9:07 pm

bob wrote:
bob wrote:A while back, there was a discussion about where to skip doing null-moves. I had been using (for several years) an exclusion where if alpha == beta - 1 I do a null-move otherwise I do not. Several had reported that adding this gave them a boost in Elo. While testing other things, I also decided to test that as well. Turns out that with or without that exclusion, the Elo in cluster testing is _identical_. Not surprising since almost all nodes are searched with alpha == beta-1 anyway, so the overhead this adds is almost nil. A few extra nodes here, a quick refutation there, it all seems to wash out. I will post the results once all of the tests have been repeated.
More data. The above tests have completed. Here's the results. Crafty-23.1R07 is the best 23.1 so far, with the standard "only try null-move if alpha == beta-1" restriction. Crafty-23.1R09 has that restriction removed so that a null is tried everywhere in the tree, every last node, PV or not. (still the same other restrictions of course, none at depth=1, nor if in check, etc.)
Code: Select all
   4 Crafty-23.1R09-2     2614    5    4 40000   51%  2605   23%
   5 Crafty-23.1R09-6     2613    4    4 40000   51%  2605   22%
   6 Crafty-23.1R07-4     2612    4    4 40000   51%  2605   22%
   7 Crafty-23.1R07-5     2612    4    4 40000   51%  2605   23%
   8 Crafty-23.1R07-6     2611    5    4 40000   51%  2605   22%
   9 Crafty-23.1R09-1     2611    4    5 40000   51%  2605   23%
  10 Crafty-23.1R07-1     2611    4    5 40000   51%  2605   23%
  11 Crafty-23.1R09-5     2611    4    5 40000   51%  2605   23%
  12 Crafty-23.1R09-4     2610    4    4 40000   51%  2605   23%
  13 Crafty-23.1R09-3     2610    3    4 40000   51%  2605   22%
  14 Crafty-23.1R07-3     2610    4    4 40000   51%  2605   23%
  15 Crafty-23.1R07-2     2610    4    4 40000   51%  2605   22%
  16 Crafty-23.1-4        2594    4    5 40000   48%  2605   22%
  17 Crafty-23.1-1        2593    4    4 40000   48%  2605   22%
  18 Crafty-23.1-2        2593    4    4 40000   48%  2605   22%
  19 Crafty-23.1-3        2592    4    4 40000   48%  2605   23%
  20 Crafty-23.0-4        2567    4    3 40000   45%  2605   21%
  21 Crafty-23.0-2        2567    4    4 40000   45%  2605   21%
  22 Crafty-23.0-1        2566    4    5 40000   45%  2605   21%
  23 Crafty-23.0-3        2565    4    4 40000   45%  2605   20%
Nothing significant going on there. They seem to be equally strong. I included 23.0 which is the released version, and 23.1 was our previous "best" before we added some optimizations and other changes over the past couple of weeks to produce 23.1R07. They also all lie within the usual error margins. So for this discussion, the answer is "makes no difference".

On a side note: R07 is 45 or so ELO better than 23.0. How many "tweaks" are involved in the increase, or is there one that stands out?

adieguez · Post by **adieguez** » Thu Oct 15, 2009 9:26 pm

Looking at the % of draws, the one with nullmove in PV nodes seems to have got more draws. With more instability I would expect the contrary.

BTW amazing test. Playing 40,000 games is like getting god's answer.

jwes wrote:So doing null-move at PV nodes saves an insignificant number of nodes while very rarely causing a bad move from a null-move false positive, i.e. in the last round of an important tournament.

mjlef · Post by **mjlef** » Thu Oct 15, 2009 9:30 pm

bob wrote:
mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.
The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.

Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.

Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.

Thanks. How fast are the "fast time controls so that a complete test can be done in 1 hour"?

Also, one suggestion. You are doing real scientific tests here, and I think they should be documented. Might I suggest a 1-2 line description of the test and copying the results to the chessprogramming.wikispaces.com site? This would help organize and share your tests result with others, plus you would not need to repeat answering so many annoying questions from people like me! If you are too busy to edit the wiki yourself, I would be glad to do this for you if you send me the test descriptions and results.

Do the repeated runs use the same starting positions, or a different set?

Mark

bob · Post by **bob** » Thu Oct 15, 2009 10:39 pm

mjlef wrote:
bob wrote:
mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.
The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.

Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.

Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.
Thanks. How fast are the "fast time controls so that a complete test can be done in 1 hour"?

10 seconds on the clock, 0.1 second increment.

Also, one suggestion. You are doing real scientific tests here, and I think they should be documented. Might I suggest a 1-2 line description of the test and copying the results to the chessprogramming.wikispaces.com site? This would help organize and share your tests result with others, plus you would not need to repeat answering so many annoying questions from people like me! If you are too busy to edit the wiki yourself, I would be glad to do this for you if you send me the test descriptions and results.

Do the repeated runs use the same starting positions, or a different set?

Mark

Same set. But if you saw the discussion last year, if you vary the size of the tree by only one node, you won't get two identical games. Using time, the size of the tree varies by tens of thousands of nodes at a minimum, usually much more.

However, as a general rule, although the games change, there are enough of them that the results are _very_ stable, Elo-wise.

bob · Post by **bob** » Thu Oct 15, 2009 10:41 pm

mhull wrote:
bob wrote:
bob wrote:A while back, there was a discussion about where to skip doing null-moves. I had been using (for several years) an exclusion where if alpha == beta - 1 I do a null-move otherwise I do not. Several had reported that adding this gave them a boost in Elo. While testing other things, I also decided to test that as well. Turns out that with or without that exclusion, the Elo in cluster testing is _identical_. Not surprising since almost all nodes are searched with alpha == beta-1 anyway, so the overhead this adds is almost nil. A few extra nodes here, a quick refutation there, it all seems to wash out. I will post the results once all of the tests have been repeated.
More data. The above tests have completed. Here's the results. Crafty-23.1R07 is the best 23.1 so far, with the standard "only try null-move if alpha == beta-1" restriction. Crafty-23.1R09 has that restriction removed so that a null is tried everywhere in the tree, every last node, PV or not. (still the same other restrictions of course, none at depth=1, nor if in check, etc.)
Code: Select all
   4 Crafty-23.1R09-2     2614    5    4 40000   51%  2605   23%
   5 Crafty-23.1R09-6     2613    4    4 40000   51%  2605   22%
   6 Crafty-23.1R07-4     2612    4    4 40000   51%  2605   22%
   7 Crafty-23.1R07-5     2612    4    4 40000   51%  2605   23%
   8 Crafty-23.1R07-6     2611    5    4 40000   51%  2605   22%
   9 Crafty-23.1R09-1     2611    4    5 40000   51%  2605   23%
  10 Crafty-23.1R07-1     2611    4    5 40000   51%  2605   23%
  11 Crafty-23.1R09-5     2611    4    5 40000   51%  2605   23%
  12 Crafty-23.1R09-4     2610    4    4 40000   51%  2605   23%
  13 Crafty-23.1R09-3     2610    3    4 40000   51%  2605   22%
  14 Crafty-23.1R07-3     2610    4    4 40000   51%  2605   23%
  15 Crafty-23.1R07-2     2610    4    4 40000   51%  2605   22%
  16 Crafty-23.1-4        2594    4    5 40000   48%  2605   22%
  17 Crafty-23.1-1        2593    4    4 40000   48%  2605   22%
  18 Crafty-23.1-2        2593    4    4 40000   48%  2605   22%
  19 Crafty-23.1-3        2592    4    4 40000   48%  2605   23%
  20 Crafty-23.0-4        2567    4    3 40000   45%  2605   21%
  21 Crafty-23.0-2        2567    4    4 40000   45%  2605   21%
  22 Crafty-23.0-1        2566    4    5 40000   45%  2605   21%
  23 Crafty-23.0-3        2565    4    4 40000   45%  2605   20%
Nothing significant going on there. They seem to be equally strong. I included 23.0 which is the released version, and 23.1 was our previous "best" before we added some optimizations and other changes over the past couple of weeks to produce 23.1R07. They also all lie within the usual error margins. So for this discussion, the answer is "makes no difference".
On a side note: R07 is 45 or so ELO better than 23.0. How many "tweaks" are involved in the increase, or is there one that stands out?

There are about 6 months worth of tweaks. Pruning (actual pruning as in futility, extended futility, etc) has been completely rewritten. Code has been speeded up in places. Some evaluation changes. The changes really are scattered all over. Nothing major, but together they add up.

Greg Strong · Post by **Greg Strong** » Fri Oct 16, 2009 2:03 am

bob wrote:There are about 6 months worth of tweaks. Pruning (actual pruning as in futility, extended futility, etc) has been completely rewritten. Code has been speeded up in places. Some evaluation changes. The changes really are scattered all over. Nothing major, but together they add up.

Awesome! Can't wait to see the new version, particularly the new futility code. Thanks for all the hard work!

xsadar · Post by **xsadar** » Sun Oct 18, 2009 9:26 pm

bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.

I have always been under the impression that you were using 5 unrelated engines rather than a few versions of 2 unrelated engines. Wouldn't you expect there to be a correlation between the games against the three different versions of glaurung/stockfish? And wouldn't you expect that to affect your overall results? Of course the fact that you use 4000 starting positions helps a lot, but it still makes me wonder if your results are as accurate as you think they are.

It seems a little to me like doing cancer research with 500 participants where 300 are related to me and 200 are related to you, then trying to generalize the results to everybody. It doesn't make sense, and I can't imagine any scientist ever doing that. They want the participants to be as diverse as possible.

bob · Post by **bob** » Mon Oct 19, 2009 12:19 am

xsadar wrote:
bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.
I have always been under the impression that you were using 5 unrelated engines rather than a few versions of 2 unrelated engines. Wouldn't you expect there to be a correlation between the games against the three different versions of glaurung/stockfish? And wouldn't you expect that to affect your overall results? Of course the fact that you use 4000 starting positions helps a lot, but it still makes me wonder if your results are as accurate as you think they are.

It seems a little to me like doing cancer research with 500 participants where 300 are related to me and 200 are related to you, then trying to generalize the results to everybody. It doesn't make sense, and I can't imagine any scientist ever doing that. They want the participants to be as diverse as possible.

Possibly. Fruit and toga play significantly differently. Glaurung 1 and 2 are significantly different. And stockfish plays nothing like the version of G2 I am using.

My primary concern is that the program(s) I use have to be reliable. A few can't deal with fast time controls. A few misbehave in other ways. I would be more concerned with just using one program of course. And I have a few others I have thrown in to the mix from time to time, but I do not want too many that are significantly weaker than Crafty as that doesn't provide much useful information. The results to date have clearly shown improvement on every testing tournament I have seen. so optimal? Probably not. But working? Yep.

Finding reliable opponents that work correctly on unix is a problem. Most programs are windows-based, and I can't run windows applications on our linux cluster. I have used arasan, gnuchess, the infamous ippolit, etc. ippolit would be a good opponent but it is beyond unreliable and I completely removed it.

another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result

Re: another interesting cluster test result