Another take on DTS?

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
mphuget
Posts: 4
Joined: Fri Mar 01, 2019 11:46 am
Full name: Marc-Philippe HUGET

Another take on DTS?

Post by mphuget » Fri Oct 25, 2019 8:31 am

Hello everyone,

I am surprised to see currently so few interests in Dynamic Tree Splitting in comparison with Lazy SMP. Is it because DTS seems to be adapted to multiprocessors or large-scale network while Lazy SMP is for multithreading? Is there any scientific or empirical comparison between the two?

Thanks for your feedback

mph

dragontamer5788
Posts: 134
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: Another take on DTS?

Post by dragontamer5788 » Fri Oct 25, 2019 9:41 pm

The main issue right now, is that multicore architectures are only now becoming affordable. Most people probably only have a 4-core / 8-thread machine, which is too small to really see a major benefit from a more-robust parallelism methodology.

The exception is the GPU, which is widely deployed but poorly understood and hard to program. Most programmers have access to a GPU with at least 256+ SIMD-threads of execution... but very, very few programmers even know how to work with it. Intel, AMD, and NVidia GPUs all have different performance characteristics, as well as different programming environments. Even if you used OpenCL, you'd need a significant amount of effort to port-over the performance between the different GPU architectures.

---------

With AMD Zen2 processors coming out with 12-cores / 24-threads for only $500, maybe more people will join the parallelism discussion. But unless people have access to a large multiprocessor computer, most parallelism talks are just the stuff of theory.

------

With regards to me personally: I researched Lazy SMP, YBWC, and ABDADA then started to work on my own code. There comes a point where you have to stop researching and start writing, and that's more or less where I'm at. (And honestly, I probably should be writing / testing more code with my methodology...)

Daniel Shawul
Posts: 3758
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Another take on DTS?

Post by Daniel Shawul » Sat Oct 26, 2019 12:09 am

I am pretty sure DTS was tested upto atleast 16 cores when it was first published and YBW upto a thousand core with its distributed scheme.
Also 8-core testers were here since 10 years ago I believe. A lot of programmers were using YBW for SMP, a few on DTS since
it requires iterative search and is more difficult to code that recursive YBW. There have been wars fought here on who has the best parallelized program (Vincent Diepeveen comes to mind), since the days of DeepBlue, CilkChess etc

However, once TCEC started using 44 cores, programmers realized lazy parallelization schemes worked better (Lazy SMP, ABDADA) etc,
because they scale better in nps and slightly better in performance too. And people love to be lazy when they can.

With regard to CUDA, a few of us have worked on them too and realized it is not suited to the current crop of parallelization schemes.
Btw how is your GPU work coming along ?

dragontamer5788
Posts: 134
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: Another take on DTS?

Post by dragontamer5788 » Sat Oct 26, 2019 5:20 am

Daniel Shawul wrote:
Sat Oct 26, 2019 12:09 am
I am pretty sure DTS was tested upto atleast 16 cores when it was first published and YBW upto a thousand core with its distributed scheme.
Also 8-core testers were here since 10 years ago I believe. A lot of programmers were using YBW for SMP, a few on DTS since
it requires iterative search and is more difficult to code that recursive YBW. There have been wars fought here on who has the best parallelized program (Vincent Diepeveen comes to mind), since the days of DeepBlue, CilkChess etc
Researchers and professors have had access to those kinds of machines for decades. But I'm just a humble home-programmer on a budget. I really do think that the hobby-level, less-academic people have something to contribute... but hobbyists have only had so long to play with multicore systems.

Just 5 years ago, the best consumer-grade CPU was the Intel i7-5960X extreme edition, 8-cores / 16 threads for $999 for the CPU alone (and probably a $300 Motherboard... the total system cost for a HEDT would be $3000+ from Dell or HP, maybe $2500 if you built it yourself. Above most people's hobby budgets). Otherwise, people probably were mostly buying something like Intel i7-4790k (4-cores / 8 threads for a total-system cost of $1500 or so). Obviously, the chess community as a whole has included some pretty fancy machines (see Deep Blue, or the 8-socket Xeon Platinums), but the general hobby programming community only really started to get access to good multithreaded hardware within the past 5 years.

And many programmers I know still stick around on 2-core / 4-thread laptops. Perfectly fine programmers, who don't even bother buying decent equipment to program with. It will be years before these laptop users will be able to run good 16-thread experiments.

But yes, obviously the people who wrote parallel algorithms + tested them in the 90s had access to highly-multithreaded systems. But there's a big difference between reading the work of others, and testing things for yourself on your own hardware! Somethings can only be learned when you play with things on your own time.
Daniel Shawul wrote:
Sat Oct 26, 2019 12:09 am
With regard to CUDA, a few of us have worked on them too and realized it is not suited to the current crop of parallelization schemes.
Btw how is your GPU work coming along ?
Exactly: none of the parallelization schemes seem to work well on a GPU.

With regards to my scheme itself... I've done a bunch of micro-benchmarks (actual GPU-code) + tiny simulations (written in Python, just conceptual stuff) to prove to myself that I'm on the right track. I got myself into "writers block" for a bit but I'vejust this past week came up with some new ways to organize the code in a more GPU-friendly way.

I don't really want to write about it until I write a bit more code, but the main concepts I discussed in my other topic remain valid. I'll write more into that topic when more obvious progress is made.

jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Another take on DTS?

Post by jdart » Sat Oct 26, 2019 4:56 pm

I realize a lot of end users don't have large multicore systems but I think many chess programmers do. I have 160 cores total, including four computers with 32 cores each.

There is also TCEC, which is running on very big x86 hardware, 256 threads I think in the current edition. I don't get to do testing on that box but I have submitted code to them over the years and found that LazySMP scales pretty well on this hardware. Lazy SMP is also easy to implement. I am sure there is room for improvement but it is a decent start. DTS might do better, but I am not sure.

--Jon

bob
Posts: 20570
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Another take on DTS?

Post by bob » Sat Oct 26, 2019 5:36 pm

My take has a couple of points.

(1) DTS is really pretty complex. And to make it work beyond 16 cores really takes some effort as lock/unlock operations can cause a lot of contention and kill performance. And it increases more than linearly with number of cores.

(2) The shared transposition table idea (lazy SMP) is significantly easier to implement. But it gives up significant performance, until the simplicity of this approach begins to look better as the lock contention overhead in DTS climbs. IE DTS is really hard to make work efficiently once you reach beyond 64 cores.

The nice thing about the shared transposition (lazy SMP) approach is that as you throw more cores at it, you keep improving performance as the search gets deeper. With DTS it is quite easy to see the opposite, in that going to 128 or 256 cores can actually reduce overall performance due to lock contention. There are solutions to mitigate lock contention, but it only makes the DTS algorithm more complex.

When I originally designed and wrote DTS, starting around 1986, I spent a whole year, full-time, developing and debugging the code. I was working on my Ph.D. so fortunately I was a full-time student at the time and could apply myself full-time since that was my dissertation topic. It really is complex, and when you go beyond 4-8 cores it only gets more complex as you try to solve lock contention problems.

syzygy
Posts: 4455
Joined: Tue Feb 28, 2012 10:56 pm

Re: Another take on DTS?

Post by syzygy » Sat Oct 26, 2019 11:09 pm

mphuget wrote:
Fri Oct 25, 2019 8:31 am
Hello everyone,

I am surprised to see currently so few interests in Dynamic Tree Splitting in comparison with Lazy SMP. Is it because DTS seems to be adapted to multiprocessors or large-scale network while Lazy SMP is for multithreading? Is there any scientific or empirical comparison between the two?
The problem with DTS and other tree-splitting approaches such as YBWC seems to be that the very low branching factor of modern chess engines makes it difficult to keep all the threads busy.

Of course lazy SMP is useless on a cluster that lacks a shared TT, but DTS and YBWC will also suffer without a shared TT.

petero2
Posts: 587
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: Another take on DTS?

Post by petero2 » Sun Oct 27, 2019 1:13 am

syzygy wrote:
Sat Oct 26, 2019 11:09 pm
Of course lazy SMP is useless on a cluster that lacks a shared TT, but DTS and YBWC will also suffer without a shared TT.
But it is possible to implement a (partially) shared TT even without shared memory, by using message passing, as demonstrated here. So lazy SMP is not useless on a cluster that lacks shared memory, even though it is certainly less efficient than on a shared memory computer.

bob
Posts: 20570
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Another take on DTS?

Post by bob » Sun Oct 27, 2019 4:45 am

Shared transposition tables on a cluster have been discussed many times. It is something that can work in theory, but in practice it has some serious issues. The most important of which is the incredible latency for inter-nodal message-passing. Even if you use the very fastest interconnection topology approaches, ie infiniband at the upper end, the latency is still a major issue. And infiniband is not something you buy at your local wal-mart or best buy, it is expensive to buy, and it is complex to set up, interconnect, and manage. So much so that the normal user is simply not going to go that route due to cost.

So, yes it works, but the latency is a killer. Trans/Ref tables are accessed frequently. Accessing high-latency data can bog performance to the point it becomes useless. Will not say it can't work. But it certainly can't "easily" work... If I were 40 years younger, this would be a topic high on my interest list. Alas, not so much today. :)

petero2
Posts: 587
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: Another take on DTS?

Post by petero2 » Sun Oct 27, 2019 6:16 am

bob wrote:
Sun Oct 27, 2019 4:45 am
Shared transposition tables on a cluster have been discussed many times. It is something that can work in theory, but in practice it has some serious issues. The most important of which is the incredible latency for inter-nodal message-passing.
Maybe "(partially) shared TT" is not the correct terminology for what I do, but my method has no extra latency for TT access compared to SMP. It does have extra latency for TT distribution though. It basically works like this:

* Each cluster node has its own local transposition table, which works the same as in a normal SMP engine. In Texel's case I use your lockless hashing algorithm. Thanks for inventing that.

* TT lookup works the same as in SMP (just look in the local TT and ignore what may be stored in the TTs on other cluster nodes.)

* TT stores do a normal store in the local TT and also, if the depth is large enough, store the TT entry in an "output queue". The output queue is just a local vector, so no cluster communication is performed during TT store. The output queue storing has negligible impact on TT store performance since it only happens if the search depth is big enough.

* Periodically the entries stored in the output queue are sent to neighboring cluster nodes, and the receiving nodes insert the entries into their local TTs.

This scheme works pretty well over regular Ethernet. As a rule of thumb, according to my measurements, multiplying the number of cluster nodes by 4 gives roughly the same Elo increase as doubling the thinking time.

At one point I even measured the Elo increase on a cluster of 2 nodes connected by a Wi-Fi network, each node using a 4 core i7 CPU. I did not post this result before because Wi-Fi performance is irregular where I live (there are like 20 other networks in range). Ping time was often around 2ms, sometimes 15ms. In a 1000 games match I measured +17 Elo with a LOS of 99.6%.


Here is a quote from the post where I described this "lazy cluster" algorithm:
petero2 wrote:
Sun Aug 06, 2017 9:15 pm
Each processor has one transposition table shared between all threads in the process. Whenever a search thread stores a result in the transposition table, it also checks if the stored depth is large enough to make it worthwhile to distribute this information to other cluster nodes. If so, the TT entry is stored in output queues containing entries to be sent to cluster node neighbors. Each cluster neighbor has its own output queue. Spare bandwidth between cluster nodes is used to transmit the queued entries to neighboring nodes. When a node receives a TT entry, it inserts it in its local transposition table, and, if the depth is large enough, also stores it in the output queues for all neighbors except the neighbor the entry came from. Since the cluster nodes form a tree structure, this ensures that each TT entry update can be distributed to all other cluster nodes, but distribution can stop at any cluster node if the TT entry depth is not large enough.

What TT entry depth is considered "large enough" for distribution to other cluster nodes is determined dynamically using a simple procedure. Each output queue is periodically transmitted to the receiving node. If the queue is less than half full when it is about to be sent, the minimum depth for that queue is decreased by 1. If on the other hand the queue becomes full before it was possible to send it to the receiving node, the minimum depth for that queue is increased by 1.

...

The transposition table design makes TT access from search threads very fast compared to more traditional distributed hash table systems where a thread has to ask another cluster node for the result of a TT probe. This is because no cluster communication is required to access a TT entry. On the other hand the asynchronous TT entry broadcasting mechanism means that there will be a delay between the time when a TT entry is created and the time when it has been distributed to all cluster nodes.

Post Reply