Significant modification of Komodo SMP implementation

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Significant modification of Komodo SMP implementation

Post by bob »

early returns from the 24.1 (mt=8) vs 24.1x (mt=0) are looking slightly worse than the 25.0 tests. Which might further point to the history info since 25.0 uses a new old history counter approach, a saturating counter idea I tested and reported on here back when I tried every possible history counter implementation I could think of right after fruit 2.1 came out and "history pruning" was the topic of the day. The current approach gets rid of the "aging" idea which might trigger significant history counter changes in 24.x and earlier, but 25.0 doesn't need to age with a saturating counter.

Early results:

1 Crafty-24.1x 2604 8 8 826 51% 2596 53%
2 Crafty-24.1 2596 8 8 826 49% 2604 53%

results are now closer to the 25.0 results after over 800 games...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Significant modification of Komodo SMP implementation

Post by bob »

I stopped here:

1 Crafty-24.1x 2608 2 2 15737 52% 2592 51%
2 Crafty-24.1 2592 2 2 15737 48% 2608 51%

24.1 is 8 thread crafty, 24.1x is 1 core crafty. sd=10 search, everything else dead even, ponder=off.

So there is a 16 Elo difference by my testing, which is pretty close to yours (I played way more games, obviously, and we use different starting positions (I play with no book, also).

I'm going to run a few tests to see if I might spot where that difference is...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Significant modification of Komodo SMP implementation

Post by Laskos »

bob wrote:I stopped here:

1 Crafty-24.1x 2608 2 2 15737 52% 2592 51%
2 Crafty-24.1 2592 2 2 15737 48% 2608 51%

24.1 is 8 thread crafty, 24.1x is 1 core crafty. sd=10 search, everything else dead even, ponder=off.

So there is a 16 Elo difference by my testing, which is pretty close to yours (I played way more games, obviously, and we use different starting positions (I play with no book, also).

I'm going to run a few tests to see if I might spot where that difference is...
Yes, that's consistent. The shared history counter array seems plausible explanation, but it needs to be verified. The overall effect seems to be a decrease in effective speedup compared to TTD speedup by maybe 5% for Crafty 25.0 and 10%, if not more, for Crafty 24.1.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Significant modification of Komodo SMP implementation

Post by bob »

Laskos wrote:
bob wrote:I stopped here:

1 Crafty-24.1x 2608 2 2 15737 52% 2592 51%
2 Crafty-24.1 2592 2 2 15737 48% 2608 51%

24.1 is 8 thread crafty, 24.1x is 1 core crafty. sd=10 search, everything else dead even, ponder=off.

So there is a 16 Elo difference by my testing, which is pretty close to yours (I played way more games, obviously, and we use different starting positions (I play with no book, also).

I'm going to run a few tests to see if I might spot where that difference is...
Yes, that's consistent. The shared history counter array seems plausible explanation, but it needs to be verified. The overall effect seems to be a decrease in effective speedup compared to TTD speedup by maybe 5% for Crafty 25.0 and 10%, if not more, for Crafty 24.1.
Rest assured I am studying the issue. There must be positions where the serial search finds something and the parallel search to same depth overlooks it. I am searching for such a position as then it is debuggable..
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Significant modification of Komodo SMP implementation

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:I stopped here:

1 Crafty-24.1x 2608 2 2 15737 52% 2592 51%
2 Crafty-24.1 2592 2 2 15737 48% 2608 51%

24.1 is 8 thread crafty, 24.1x is 1 core crafty. sd=10 search, everything else dead even, ponder=off.

So there is a 16 Elo difference by my testing, which is pretty close to yours (I played way more games, obviously, and we use different starting positions (I play with no book, also).

I'm going to run a few tests to see if I might spot where that difference is...
Yes, that's consistent. The shared history counter array seems plausible explanation, but it needs to be verified. The overall effect seems to be a decrease in effective speedup compared to TTD speedup by maybe 5% for Crafty 25.0 and 10%, if not more, for Crafty 24.1.
Rest assured I am studying the issue. There must be positions where the serial search finds something and the parallel search to same depth overlooks it. I am searching for such a position as then it is debuggable..
The problem is the parallel search to the same depth is not reproducible. I can give positions on which they disagree (to depth=12, for example), and "narrowing" is confirmed (35 to 31 solved). These positions have high probability to give disagreements between 1 thread and 4 threads.

STS 15
100 test positions to depth=12
1 thread: solved 35/100
4 thread: solved 31/100

Crafty 24.1

hash=1024m
hashp=128m
mt=1
sd=12

and

hash=1024m
hashp=128m
mt=4
sd=12

Code: Select all

STS 15 (100 positions) to depth=12:

1 thread: 35/100

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   -   0   -   -   0   -   -   -   -   0   -   -   -   -   -   -   0   0   0   -
  20 |   -   0   -   0   0   -   0   -   -   -   -   -   -   -   -   0   -   -   0   -
  40 |   -   -   0   -   0   0   -   -   0   -   0   0   0   -   -   -   -   0   0   0
  60 |   -   -   -   0   -   0   -   -   0   -   -   -   0   -   0   0   -   -   -   0
  80 |   -   -   -   -   -   -   -   0   -   -   0   0   -   -   -   -   -   0   0   0



4 threads: 31/100

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   -   -   -   -   0   -   -   -   -   0   -   -   -   -   0   -   0   0   0   -
  20 |   -   0   -   0   0   -   0   -   -   -   -   -   -   -   -   0   -   -   0   -
  40 |   -   -   0   -   0   -   -   0   0   -   0   0   0   -   -   -   -   0   -   0
  60 |   -   -   -   0   -   -   -   -   0   -   -   -   0   0   0   -   -   -   -   0
  80 |   -   -   -   -   -   -   -   -   -   -   -   0   -   -   -   -   -   0   0   0




Different: 2, 15, 46, 48, 59, 66, 74, 76, 88, 91

rn2kbnr/pp3pp1/q1p1p3/3pP2p/3P4/2NQ1N2/PPP2PPP/R1B2RK1 w kq - bm Qd1;
r3k2r/5p2/p3p2p/3pPb1P/1q6/8/1PPQNP2/2KR3R b kq - bm Qa4;
6k1/1p3pp1/p1rpq2p/3Q4/P1P2P2/1P6/3R1KPP/8 w - - bm Qd4;
5rk1/1p1r1pp1/2pNp2p/p1PnP2P/3P4/1Pq2Q2/P4P1K/3RR3 w - - bm Qe2;
4r1k1/p4pp1/4r1b1/q1pp2P1/5B1P/8/PPPQ1P2/1K1R2R1 b - - bm Qb5;
3r2rk/1p3Rpb/2nBp2p/p1P5/1pB1q3/1P2Q2P/5RPK/8 w - - bm Qg3;
2rr2k1/pp1bppbp/2n3p1/q7/2BPPN2/4BP2/P2Q2PP/2RR2K1 w - - bm Qb2;
2rr2k1/4bppp/p3p1n1/4P3/2q1Q1P1/2N3B1/PP3R1P/5R1K b - - bm Rd4;
1r6/8/1ppp1qkp/5pn1/2P2B2/2Q3R1/1P3PP1/6K1 w - - bm Qd2;
1r4k1/1brp1pp1/1p3n1p/2q1p3/2P1P3/1PNBQ3/P3RP1P/3R3K w - - bm Qh3;
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Significant modification of Komodo SMP implementation

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:I stopped here:

1 Crafty-24.1x 2608 2 2 15737 52% 2592 51%
2 Crafty-24.1 2592 2 2 15737 48% 2608 51%

24.1 is 8 thread crafty, 24.1x is 1 core crafty. sd=10 search, everything else dead even, ponder=off.

So there is a 16 Elo difference by my testing, which is pretty close to yours (I played way more games, obviously, and we use different starting positions (I play with no book, also).

I'm going to run a few tests to see if I might spot where that difference is...
Yes, that's consistent. The shared history counter array seems plausible explanation, but it needs to be verified. The overall effect seems to be a decrease in effective speedup compared to TTD speedup by maybe 5% for Crafty 25.0 and 10%, if not more, for Crafty 24.1.
Rest assured I am studying the issue. There must be positions where the serial search finds something and the parallel search to same depth overlooks it. I am searching for such a position as then it is debuggable..
The problem is the parallel search to the same depth is not reproducible. I can give positions on which they disagree (to depth=12, for example), and "narrowing" is confirmed (35 to 31 solved). These positions have high probability to give disagreements between 1 thread and 4 threads.

STS 15
100 test positions to depth=12
1 thread: solved 35/100
4 thread: solved 31/100

Crafty 24.1

hash=1024m
hashp=128m
mt=1
sd=12

and

hash=1024m
hashp=128m
mt=4
sd=12

Code: Select all

STS 15 (100 positions) to depth=12:

1 thread: 35/100

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   -   0   -   -   0   -   -   -   -   0   -   -   -   -   -   -   0   0   0   -
  20 |   -   0   -   0   0   -   0   -   -   -   -   -   -   -   -   0   -   -   0   -
  40 |   -   -   0   -   0   0   -   -   0   -   0   0   0   -   -   -   -   0   0   0
  60 |   -   -   -   0   -   0   -   -   0   -   -   -   0   -   0   0   -   -   -   0
  80 |   -   -   -   -   -   -   -   0   -   -   0   0   -   -   -   -   -   0   0   0



4 threads: 31/100

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   -   -   -   -   0   -   -   -   -   0   -   -   -   -   0   -   0   0   0   -
  20 |   -   0   -   0   0   -   0   -   -   -   -   -   -   -   -   0   -   -   0   -
  40 |   -   -   0   -   0   -   -   0   0   -   0   0   0   -   -   -   -   0   -   0
  60 |   -   -   -   0   -   -   -   -   0   -   -   -   0   0   0   -   -   -   -   0
  80 |   -   -   -   -   -   -   -   -   -   -   -   0   -   -   -   -   -   0   0   0




Different: 2, 15, 46, 48, 59, 66, 74, 76, 88, 91

rn2kbnr/pp3pp1/q1p1p3/3pP2p/3P4/2NQ1N2/PPP2PPP/R1B2RK1 w kq - bm Qd1;
r3k2r/5p2/p3p2p/3pPb1P/1q6/8/1PPQNP2/2KR3R b kq - bm Qa4;
6k1/1p3pp1/p1rpq2p/3Q4/P1P2P2/1P6/3R1KPP/8 w - - bm Qd4;
5rk1/1p1r1pp1/2pNp2p/p1PnP2P/3P4/1Pq2Q2/P4P1K/3RR3 w - - bm Qe2;
4r1k1/p4pp1/4r1b1/q1pp2P1/5B1P/8/PPPQ1P2/1K1R2R1 b - - bm Qb5;
3r2rk/1p3Rpb/2nBp2p/p1P5/1pB1q3/1P2Q2P/5RPK/8 w - - bm Qg3;
2rr2k1/pp1bppbp/2n3p1/q7/2BPPN2/4BP2/P2Q2PP/2RR2K1 w - - bm Qb2;
2rr2k1/4bppp/p3p1n1/4P3/2q1Q1P1/2N3B1/PP3R1P/5R1K b - - bm Rd4;
1r6/8/1ppp1qkp/5pn1/2P2B2/2Q3R1/1P3PP1/6K1 w - - bm Qd2;
1r4k1/1brp1pp1/1p3n1p/2q1p3/2P1P3/1PNBQ3/P3RP1P/3R3K w - - bm Qh3;
I have found a very subtle timing condition that has been around in Crafty for about as long as I can imagine. Has to do with an unexpected exit from Search() that I had not noticed. If I am splitting at the root, and one thread fails high, it stops the others and then they exit, get me back to Iterate() which changes the bound, and off to search again. But several threads were busily searching root moves and as one is grabbed to search, it is flagged as "searched already" so it won't be searched by another thread. I thought I had everything covered so that any threads that were stopped prematurely would reset that "searched" flag so that after the fail high is completed, they will be picked up and searched completely before we are done. But there was a way out where that was not done. And that most definitely "narrows" the tree since some stuff is now not going to be searched. I found I can make it happen with a 5 ply search using 20 cores, with WAC 141. Real problem happens when there are several moves that are going to fail high. Which ever fails high first (assuming all are "in progress") will be the only one that has a chance, the rest won't get searched until the next iteration, if there is one.

I've decided to rewrite that part of the search from the ground up as it was too messy. For example, the "make move, call alpha/beta, fail high and return" is a problem because the move has to be unmade before the return. But it has to be unmade after that is finished to prepare for the next move. There were too many special cases to like, so eventually it will be something that is at least bearable to look at...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Significant modification of Komodo SMP implementation

Post by bob »

Finally got the search rewritten and significantly cleaned up. Still have some testing to do, but one thing is for sure, I learned something new I had not considered in the past. Namely that testing 1 core vs N cores to a fixed depth is actually an interesting debugging tool. I don't expect my parallel search to explode and search a lot more, so I don't expect it to have any significant Elo advantage over one core to the same depth. But by the same token, I don't expect it to give up anything whatsoever, and do expect either a break-even or better result.

This helped me find a problem (described in previous post) that has been around for quite a while. What I did was to write a script that fed a zillion positions to mt=0 and mt=20 crafty, starting at depth=5 which is about the minimum that will work as it takes at least 5 remaining plies before a split can be done. This test splits only at the root, and I found two WAC positions that failed. But not eery time. Just every now and then. When I say failed, I mean simply chose the wrong move, where the right move is a clear tactical move and not some minor positional nuance that an SMP search can alter.

Kai gets an honorable mention here for a testing strategy I had never considered. I regularly compare a serial and parallel search, but I didn't try to burn it up with shallow searches, as this test (original sd=10 is near instant moves, playing a game in under one second).

I wrote my script to start at 5, because when ac41 failed to work correctly every now and then, it produced such a small tree that I could actually dump it, and look to see who was doing what and what was not searched correctly. As I mentioned, an early move failed high, but a later move was going to fail even higher. Except that the first move was searched and failed before the better move was completed. That thread stopped as directed, but failed to note that the move it was searching had not been completed, so it had no chance to fail high until the next iteration. Not a real common thing unless you search these impossibly fast/small trees. I have now run this position 1,000,000 times without a failure so this bug seems to be gone.

However, after this experiment, I am going to go through the search with not just a fine-toothed comb, but with a microscopic comb, to make sure everything is as it should be. I've already cleaned it up to get rid of many of the multiple exits that haunted me here...

I am going to re-run the mt=1 vs mt=8 test again, to see if anything else odd shows up. More on this later...

Here's the early results:

Code: Select all

 
   1 Crafty-25.0    2601    7    7  1209   50%  2599   49% 
   2 Crafty-25.0x  2599    7    7  1209   50%  2601   49% 
25.0 is mt=8, 25.0x is mt=0.

This is more in line with what I was expecting, not the -10 or -18 stuff I was seeing previously. I'll post the results tomorrow morning once this gets beyond 20K games...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

more

Post by bob »

Seems to be stabilizing:

Code: Select all

   1 Crafty-25.0   2606    5    5  2265   52%  2594   48% 
   2 Crafty-25.0x  2594    5    5  2265   48%  2606   48% 
I think tomorrow I am going to hack 25.0 and 25.0x to count the total nodes searched in a game. Then sum them all up and compare to see how much overhead the 8 CPU version adds compared to a pure 1 cpu search to fixed depth.

Never thought of this kind of a test. Ought to be able to do the same for 1cpu, 2cpus, 4cpus, and 8 cpus (going beyond 8 means using the 20 core box and I only have one node until later this summer).

Be interesting to have both speedup and overhead numbers. I've tried to measure overhead in the past but it is not easy. This is actually an easy way to measure this. Also be interesting to see what happens with deeper searches. I can probably play sd=20 fast enough to get some decent numbers...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: final results

Post by bob »

This ran overnight and here's the final numbers after 30K games:

Code: Select all

   1 Crafty-25.0   2603    1    1 30016   51%  2597   48% 
   2 Crafty-25.0x  2597    1    1 30016   49%  2603   48% 
So the current version appears to unintentionally widen just a bit, but since move ordering is never perfect, search overhead is a part of the game at present, although it has a remarkably small effect, although only measured at depth=10.

Let me also add this is a sub-optimal tuning. I am letting Crafty split everywhere in the search, with the usual min split depth set to zero, so the real search will widen less. A quick test suggested a 2 Elo change, 2602 vs 2598.

I ran another test as this represents something someone might do in a parallel search. In Crafty, when I split, a fail high at a split point instantly terminates all threads helping at that point or deeper in the same sub-tree (split here and results farther down one of the branches from this point). I disabled this, which means that every split has to complete normally, whether there is a fail high or not. This definitely widens the tree by a slight amount. But the Elo effect is almost nil. In a real search it would make a difference, since waiting on everything to finish after a fail high simply wastes time and slows the search down. But here, time doesn't matter (fixed depth search limit).

Just for fun, I repeated with null-move disabled. Since this is still to a fixed depth, I thought it would be interesting to see what effect this might have as this will definitely widen the tree with no apparent cost thanks to sd=10... This was a bit more interesting in that it resulted in a 2622-2578 (+44 Elo!) change. Of course we know null-move makes mistakes, but to compensate for those mistakes we get a significantly deeper search with more than compensates. Previous tests with Crafty showed removing null-move was an 80 Elo loss, which means that overall it is a 120 Elo gain due to extra depth and a 40 elo loss due to zugzwang/reduced depth errors.

Previous tests also showed that removing LMR by itself was about the same Elo loss, which means that in the above testing, it would likely produce about the same result as null-move removal.

It seems there are roughly four ways to "widen" realistically.

(1) search overhead caused by choosing poor split points and/or failing to stop/recover quickly when an unwanted fail-high occurs at a split point.

(2) back off on null-move reductions which will increase the size of the tree.

(3) back off on LMR reductions ditto.

(4) limit forward pruning (futility/LMP/etc) by increasing the margins, which again will make the tree larger and the depth lower.

I'm calling this test done (note that removing null-move greatly increases the length of the test games since sd=10 now takes a lot longer for one opponent.)

Next step is to re-run the parallel search tests. I don't expect major changes, but the results will tell the tale...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: final results

Post by Laskos »

bob wrote:This ran overnight and here's the final numbers after 30K games:

Code: Select all

   1 Crafty-25.0   2603    1    1 30016   51%  2597   48% 
   2 Crafty-25.0x  2597    1    1 30016   49%  2603   48% 
So the current version appears to unintentionally widen just a bit, but since move ordering is never perfect, search overhead is a part of the game at present, although it has a remarkably small effect, although only measured at depth=10.

Let me also add this is a sub-optimal tuning. I am letting Crafty split everywhere in the search, with the usual min split depth set to zero, so the real search will widen less. A quick test suggested a 2 Elo change, 2602 vs 2598.

I ran another test as this represents something someone might do in a parallel search. In Crafty, when I split, a fail high at a split point instantly terminates all threads helping at that point or deeper in the same sub-tree (split here and results farther down one of the branches from this point). I disabled this, which means that every split has to complete normally, whether there is a fail high or not. This definitely widens the tree by a slight amount. But the Elo effect is almost nil. In a real search it would make a difference, since waiting on everything to finish after a fail high simply wastes time and slows the search down. But here, time doesn't matter (fixed depth search limit).

Just for fun, I repeated with null-move disabled. Since this is still to a fixed depth, I thought it would be interesting to see what effect this might have as this will definitely widen the tree with no apparent cost thanks to sd=10... This was a bit more interesting in that it resulted in a 2622-2578 (+44 Elo!) change. Of course we know null-move makes mistakes, but to compensate for those mistakes we get a significantly deeper search with more than compensates. Previous tests with Crafty showed removing null-move was an 80 Elo loss, which means that overall it is a 120 Elo gain due to extra depth and a 40 elo loss due to zugzwang/reduced depth errors.

Previous tests also showed that removing LMR by itself was about the same Elo loss, which means that in the above testing, it would likely produce about the same result as null-move removal.

It seems there are roughly four ways to "widen" realistically.

(1) search overhead caused by choosing poor split points and/or failing to stop/recover quickly when an unwanted fail-high occurs at a split point.

(2) back off on null-move reductions which will increase the size of the tree.

(3) back off on LMR reductions ditto.

(4) limit forward pruning (futility/LMP/etc) by increasing the margins, which again will make the tree larger and the depth lower.

I'm calling this test done (note that removing null-move greatly increases the length of the test games since sd=10 now takes a lot longer for one opponent.)

Next step is to re-run the parallel search tests. I don't expect major changes, but the results will tell the tale...
Would be curious now to see the speedup numbers with the new Crafty 25. I guess time-to-depth (TTD) speedup will go down a bit. Effective speedup in the narrowing Crafty 25 is maybe 5% smaller than TTD speedup, while in the corrected, a little bit widening Crafty 25, effective speedup is maybe 5% higher than TTD speedup. All in all, the corrected Crafty will probably gain on multicore several ELO points compared to the buggy one.