petero2 wrote: ↑Sun Oct 27, 2019 7:16 am
* Each cluster node has its own local transposition table, which works the same as in a normal SMP engine. In Texel's case I use your lockless hashing algorithm. Thanks for inventing that.
What's this lockless hashing algorithm? I'm definitely interested in it.
petero2 wrote: ↑Sun Oct 27, 2019 11:02 pmSo the interesting question to me is: If you have a cluster of modern computers (e.g. Intel Core CPUs) connected with Ethernet networking, what is the best way to use this cluster to maximize playing strength of your chess engine?
One thing that I like about my "Lazy Futures" approach is that its basically a message-passing algorithm which probably would work decently well in an Ethernet cluster setting. I've got a topic somewhere else on the subject, but the short-story is that Alpha / Beta bounds are expression-trees of futures, instead of ints. A drawing from my last topic is as follows:
The full tree assumes 3-children per position.
Its quite big, but you can see how the alpha-beta bounds were calculated from the root if you prefer. The root position "P" starts with Alpha-Beta bounds of [-inf, +inf], and each row can be evaluated in parallel without any loss in work-efficiency.
The alpha / beta expressions are trees written in reverse polish notation. For example, P31's alpha is "inf - F1 max F2 max", which evaluates into: max(F2, max(F1, - infinity)). Negate (or "-" ) is reverse-polish notation, affecting the single item immediately to the left of the negation sign, so its a "negative infinity" for P31's alpha.
"F" values (such as F1 or F2) are futures, values that are not known yet because they are being evaluated in parallel. F1 represents the future-evaluation of position P1. (While F131 would be the future-evaluation of position P131). A future-binding is simply a message, such as "F1 equals +55 centipawns", possibly "LowerBound: F1 > +20 centipawns" and "Upperbound: F1 < +90 centipawns". Such a message would only need a 64-bit future-identifier + 16-bit centipawn evaluation: 10 bytes broadcast. I don't know if lower-bound or upper-bound messages would be useful in my scheme, but its possible.
"Beta-blocked" is only relative to work-efficiency. If you wanted a 100% work efficient algorithm, you'd have to wait until Score vs Beta were both fully bound. For example,Beta-Blocked P32 would need F31, F1, and F2 to be bound to a evaluation if you wanted an equivalent work-efficiency to sequential alpha-beta. Its pretty natural to execute P32 speculatively however.
Anyway, a speculative P32 would look like: alpha: [inf - F1 max F2 max] (identical to P31, its "older brother", or negate(beta) from its father P3). While P32's beta would be [inf- F31 max -]... that is: negate(max(parent.alpha, olderBrother's future)). All children, such as P321, P3211, P32111, will remember that F31 is "still pending", and a sanity check on alpha-beta bounds at a later time (when more values are bound) will let you know if the positions have entered beta-cutoff and can be discarded early.
--------
So in effect:
1. Sort all the positions on your cluster in terms of work efficiency. For example, P33 would be speculating on two values: F32 and F31, while P32 only speculates on F31... while P31 has no speculation at all. As such, P31 is fully work efficient (sequential alpha-beta
must, and
always will visit P31), while P32, P33, P34... are more-and-more speculative.
2. If you run out of memory, focus on the greatest-depth positions to "clean up" memory. This will force some degree of speculation.
3. Load balance by message-passing positions to other computers in your cluster. However, every position you pass to another computer will increase the ethernet bandwidth required. For example, if you load-balance P32 to another computer, this computer will need to broadcast the bindings of F1, F2, and F31 to all computers in the cluster. (by the time F1, F2, and F31 are complete, you
don't know if the other computer load-balanced P32's children to other computers. Therefore, you should probably do an expensive full-broadcast).
With that being said, by having the futures explicitly laid out here, you can load-balance in such a way to minimize communications and save on ethernet bandwidth. For example, P31 only has F1 and F2 that its waiting on, so the "cost" of load balancing P31 is 33% less than broadcasting P32.
--------
So yeah... it seems like this scheme would work out very well for message passing. Not that I've done it or anything, I'm focusing on GPU-implementation for now. GPU is what I got at home so that's what I'm going for. But I've spent some time hypothesizing how I'd generalize the approach to 4x GPUs instead of 1x GPU... so that's why I know what to type up here in theory