Well, I am thinking along the same lines as he is. Because whether you null move or not in PV nodes doesn't matter for strength, I would not do it for aesthetic reasons. I think that's all he means.Gian-Carlo Pascutto wrote:This reasoning is completely ignorant of the statistics Bob already presented.jwes wrote:So doing null-move at PV nodes saves an insignificant number of nodes while very rarely causing a bad move from a null-move false positive, i.e. in the last round of an important tournament.
The results from Bob show that even if this change would cause randomly to throw one game in 5, it causes a strength increase in the 4 other games enough to completely offset and overcome this loss.
What's really happening is probably not quite so extreme, but the reasoning you give is flawed nevertheless: the results already show that the program is no weaker with this change.
another interesting cluster test result
Moderator: Ras
-
- Posts: 1922
- Joined: Thu Mar 09, 2006 12:51 am
- Location: Earth
Re: another interesting cluster test result
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: another interesting cluster test result
The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.
Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.
Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.
-
- Posts: 13447
- Joined: Wed Mar 08, 2006 9:02 pm
- Location: Dallas, Texas
- Full name: Matthew Hull
Re: another interesting cluster test result
On a side note: R07 is 45 or so ELO better than 23.0. How many "tweaks" are involved in the increase, or is there one that stands out?bob wrote:More data. The above tests have completed. Here's the results. Crafty-23.1R07 is the best 23.1 so far, with the standard "only try null-move if alpha == beta-1" restriction. Crafty-23.1R09 has that restriction removed so that a null is tried everywhere in the tree, every last node, PV or not. (still the same other restrictions of course, none at depth=1, nor if in check, etc.)bob wrote:A while back, there was a discussion about where to skip doing null-moves. I had been using (for several years) an exclusion where if alpha == beta - 1 I do a null-move otherwise I do not. Several had reported that adding this gave them a boost in Elo. While testing other things, I also decided to test that as well. Turns out that with or without that exclusion, the Elo in cluster testing is _identical_. Not surprising since almost all nodes are searched with alpha == beta-1 anyway, so the overhead this adds is almost nil. A few extra nodes here, a quick refutation there, it all seems to wash out. I will post the results once all of the tests have been repeated.
Nothing significant going on there. They seem to be equally strong. I included 23.0 which is the released version, and 23.1 was our previous "best" before we added some optimizations and other changes over the past couple of weeks to produce 23.1R07. They also all lie within the usual error margins. So for this discussion, the answer is "makes no difference".Code: Select all
4 Crafty-23.1R09-2 2614 5 4 40000 51% 2605 23% 5 Crafty-23.1R09-6 2613 4 4 40000 51% 2605 22% 6 Crafty-23.1R07-4 2612 4 4 40000 51% 2605 22% 7 Crafty-23.1R07-5 2612 4 4 40000 51% 2605 23% 8 Crafty-23.1R07-6 2611 5 4 40000 51% 2605 22% 9 Crafty-23.1R09-1 2611 4 5 40000 51% 2605 23% 10 Crafty-23.1R07-1 2611 4 5 40000 51% 2605 23% 11 Crafty-23.1R09-5 2611 4 5 40000 51% 2605 23% 12 Crafty-23.1R09-4 2610 4 4 40000 51% 2605 23% 13 Crafty-23.1R09-3 2610 3 4 40000 51% 2605 22% 14 Crafty-23.1R07-3 2610 4 4 40000 51% 2605 23% 15 Crafty-23.1R07-2 2610 4 4 40000 51% 2605 22% 16 Crafty-23.1-4 2594 4 5 40000 48% 2605 22% 17 Crafty-23.1-1 2593 4 4 40000 48% 2605 22% 18 Crafty-23.1-2 2593 4 4 40000 48% 2605 22% 19 Crafty-23.1-3 2592 4 4 40000 48% 2605 23% 20 Crafty-23.0-4 2567 4 3 40000 45% 2605 21% 21 Crafty-23.0-2 2567 4 4 40000 45% 2605 21% 22 Crafty-23.0-1 2566 4 5 40000 45% 2605 21% 23 Crafty-23.0-3 2565 4 4 40000 45% 2605 20%
Matthew Hull
Re: another interesting cluster test result
Looking at the % of draws, the one with nullmove in PV nodes seems to have got more draws. With more instability I would expect the contrary.
BTW amazing test. Playing 40,000 games is like getting god's answer.
BTW amazing test. Playing 40,000 games is like getting god's answer.
jwes wrote:So doing null-move at PV nodes saves an insignificant number of nodes while very rarely causing a bad move from a null-move false positive, i.e. in the last round of an important tournament.
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: another interesting cluster test result
Thanks. How fast are the "fast time controls so that a complete test can be done in 1 hour"?bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.
Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.
Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.
Also, one suggestion. You are doing real scientific tests here, and I think they should be documented. Might I suggest a 1-2 line description of the test and copying the results to the chessprogramming.wikispaces.com site? This would help organize and share your tests result with others, plus you would not need to repeat answering so many annoying questions from people like me! If you are too busy to edit the wiki yourself, I would be glad to do this for you if you send me the test descriptions and results.
Do the repeated runs use the same starting positions, or a different set?
Mark
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: another interesting cluster test result
10 seconds on the clock, 0.1 second increment.mjlef wrote:Thanks. How fast are the "fast time controls so that a complete test can be done in 1 hour"?bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.mjlef wrote:Who are the opponents? What are the time controls? Perhaps a link to the testing criteria/setup would be useful, especially if it does not change.
Stockfish is significantly stronger, glaurung 2.x/toga are just about +10 Elo above Crafty, the other two are significantly behind. I do not change opponents very frequently because I want the testbed to remain reasonably consistent because the measurements I am trying to do are for very small Elo gains/losses.
Time control varies. I run some tests (a test is 4000 positions x 2 games to alternate colors x 5 opponents = 40,000 games) with fast time controls so that a complete test can be done in 1 hour. I run some at 1+1 which takes around 12 hours to complete. I run some that are much longer. Mostly I use the short games for sanity checks, then 1+1 to really measure the results. These were all roughly in the middle of that so that a single run takes on the average about 4 hours.
Same set. But if you saw the discussion last year, if you vary the size of the tree by only one node, you won't get two identical games. Using time, the size of the tree varies by tens of thousands of nodes at a minimum, usually much more.
Also, one suggestion. You are doing real scientific tests here, and I think they should be documented. Might I suggest a 1-2 line description of the test and copying the results to the chessprogramming.wikispaces.com site? This would help organize and share your tests result with others, plus you would not need to repeat answering so many annoying questions from people like me! If you are too busy to edit the wiki yourself, I would be glad to do this for you if you send me the test descriptions and results.
Do the repeated runs use the same starting positions, or a different set?
Mark
However, as a general rule, although the games change, there are enough of them that the results are _very_ stable, Elo-wise.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: another interesting cluster test result
There are about 6 months worth of tweaks. Pruning (actual pruning as in futility, extended futility, etc) has been completely rewritten. Code has been speeded up in places. Some evaluation changes. The changes really are scattered all over. Nothing major, but together they add up.mhull wrote:On a side note: R07 is 45 or so ELO better than 23.0. How many "tweaks" are involved in the increase, or is there one that stands out?bob wrote:More data. The above tests have completed. Here's the results. Crafty-23.1R07 is the best 23.1 so far, with the standard "only try null-move if alpha == beta-1" restriction. Crafty-23.1R09 has that restriction removed so that a null is tried everywhere in the tree, every last node, PV or not. (still the same other restrictions of course, none at depth=1, nor if in check, etc.)bob wrote:A while back, there was a discussion about where to skip doing null-moves. I had been using (for several years) an exclusion where if alpha == beta - 1 I do a null-move otherwise I do not. Several had reported that adding this gave them a boost in Elo. While testing other things, I also decided to test that as well. Turns out that with or without that exclusion, the Elo in cluster testing is _identical_. Not surprising since almost all nodes are searched with alpha == beta-1 anyway, so the overhead this adds is almost nil. A few extra nodes here, a quick refutation there, it all seems to wash out. I will post the results once all of the tests have been repeated.
Nothing significant going on there. They seem to be equally strong. I included 23.0 which is the released version, and 23.1 was our previous "best" before we added some optimizations and other changes over the past couple of weeks to produce 23.1R07. They also all lie within the usual error margins. So for this discussion, the answer is "makes no difference".Code: Select all
4 Crafty-23.1R09-2 2614 5 4 40000 51% 2605 23% 5 Crafty-23.1R09-6 2613 4 4 40000 51% 2605 22% 6 Crafty-23.1R07-4 2612 4 4 40000 51% 2605 22% 7 Crafty-23.1R07-5 2612 4 4 40000 51% 2605 23% 8 Crafty-23.1R07-6 2611 5 4 40000 51% 2605 22% 9 Crafty-23.1R09-1 2611 4 5 40000 51% 2605 23% 10 Crafty-23.1R07-1 2611 4 5 40000 51% 2605 23% 11 Crafty-23.1R09-5 2611 4 5 40000 51% 2605 23% 12 Crafty-23.1R09-4 2610 4 4 40000 51% 2605 23% 13 Crafty-23.1R09-3 2610 3 4 40000 51% 2605 22% 14 Crafty-23.1R07-3 2610 4 4 40000 51% 2605 23% 15 Crafty-23.1R07-2 2610 4 4 40000 51% 2605 22% 16 Crafty-23.1-4 2594 4 5 40000 48% 2605 22% 17 Crafty-23.1-1 2593 4 4 40000 48% 2605 22% 18 Crafty-23.1-2 2593 4 4 40000 48% 2605 22% 19 Crafty-23.1-3 2592 4 4 40000 48% 2605 23% 20 Crafty-23.0-4 2567 4 3 40000 45% 2605 21% 21 Crafty-23.0-2 2567 4 4 40000 45% 2605 21% 22 Crafty-23.0-1 2566 4 5 40000 45% 2605 21% 23 Crafty-23.0-3 2565 4 4 40000 45% 2605 20%
-
- Posts: 388
- Joined: Sun Dec 21, 2008 6:57 pm
- Location: Washington, DC
Re: another interesting cluster test result
Awesome! Can't wait to see the new version, particularly the new futility code. Thanks for all the hard work!bob wrote:There are about 6 months worth of tweaks. Pruning (actual pruning as in futility, extended futility, etc) has been completely rewritten. Code has been speeded up in places. Some evaluation changes. The changes really are scattered all over. Nothing major, but together they add up.
-
- Posts: 147
- Joined: Wed Jun 06, 2007 10:01 am
- Location: United States
- Full name: Mike Leany
Re: another interesting cluster test result
I have always been under the impression that you were using 5 unrelated engines rather than a few versions of 2 unrelated engines. Wouldn't you expect there to be a correlation between the games against the three different versions of glaurung/stockfish? And wouldn't you expect that to affect your overall results? Of course the fact that you use 4000 starting positions helps a lot, but it still makes me wonder if your results are as accurate as you think they are.bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.
It seems a little to me like doing cancer research with 500 participants where 300 are related to me and 200 are related to you, then trying to generalize the results to everybody. It doesn't make sense, and I can't imagine any scientist ever doing that. They want the participants to be as diverse as possible.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: another interesting cluster test result
Possibly. Fruit and toga play significantly differently. Glaurung 1 and 2 are significantly different. And stockfish plays nothing like the version of G2 I am using.xsadar wrote:I have always been under the impression that you were using 5 unrelated engines rather than a few versions of 2 unrelated engines. Wouldn't you expect there to be a correlation between the games against the three different versions of glaurung/stockfish? And wouldn't you expect that to affect your overall results? Of course the fact that you use 4000 starting positions helps a lot, but it still makes me wonder if your results are as accurate as you think they are.bob wrote:The current set of opponents is Stockfish, glaurung 2.x (last released version whatever that is), Toga (most recent version), fruit 2.something and glaurung 1.
It seems a little to me like doing cancer research with 500 participants where 300 are related to me and 200 are related to you, then trying to generalize the results to everybody. It doesn't make sense, and I can't imagine any scientist ever doing that. They want the participants to be as diverse as possible.
My primary concern is that the program(s) I use have to be reliable. A few can't deal with fast time controls. A few misbehave in other ways. I would be more concerned with just using one program of course. And I have a few others I have thrown in to the mix from time to time, but I do not want too many that are significantly weaker than Crafty as that doesn't provide much useful information. The results to date have clearly shown improvement on every testing tournament I have seen. so optimal? Probably not. But working? Yep.
Finding reliable opponents that work correctly on unix is a problem. Most programs are windows-based, and I can't run windows applications on our linux cluster. I have used arasan, gnuchess, the infamous ippolit, etc. ippolit would be a good opponent but it is beyond unreliable and I completely removed it.