trying to understand mcts

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: trying to understand mcts

Post by Rémi Coulom »

smatovic wrote: Wed Oct 02, 2019 10:25 pm I tried to reuse the current tree for the next move, but in Zeta v097/v098 I had
to copy the tree back n forth between cpu/gpu, and the memory got quickly
filled, so I kept it as disabled option. But in theory it should give you an
boost.
You store the tree in the GPU? I am very surprised. My intuition is that managing the search tree with the CPU should be much more efficient, but I may be wrong.
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: trying to understand mcts

Post by smatovic »

Rémi Coulom wrote: Thu Oct 03, 2019 11:56 am
smatovic wrote: Wed Oct 02, 2019 10:25 pm I tried to reuse the current tree for the next move, but in Zeta v097/v098 I had
to copy the tree back n forth between cpu/gpu, and the memory got quickly
filled, so I kept it as disabled option. But in theory it should give you an
boost.
You store the tree in the GPU? I am very surprised. My intuition is that managing the search tree with the CPU should be much more efficient, but I may be wrong.
Yes, in GPU.

Zeta v097 and v098 were an attempt to make use of thousands of parallel gpu-
threads via an parallel Best-First-MiniMax search and classic evaluation.

The aim was to avoid host-device latencies, so the whole movegen, search and
evaluation was done on gpu, only board and history and params transmitted forth,
and bestmove back.

It was experimental, and buggy, but played decent chess...

https://github.com/smatovic/Zeta/tree/v098

--
Srdja
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: trying to understand mcts

Post by Rémi Coulom »

smatovic wrote: Thu Oct 03, 2019 12:30 pm Zeta v097 and v098 were an attempt to make use of thousands of parallel gpu-
threads via an parallel Best-First-MiniMax search and classic evaluation.
Very interesting, thanks.

When using large neural networks, evaluation is so slow that data transfers between CPU and GPU have very little cost.

To go back to the main topic: I have never tried MCTS with random playouts in chess. It might be interesting and fun to try, but it probably won't be very strong. But MCTS works well with the Alpha Zero approach.

I recommend not using the UCB formula for MCTS. The formula used by Alpha Zero is different, and better.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: trying to understand mcts

Post by dragontamer5788 »

Rémi Coulom wrote: Thu Oct 03, 2019 2:43 pm
smatovic wrote: Thu Oct 03, 2019 12:30 pm Zeta v097 and v098 were an attempt to make use of thousands of parallel gpu-
threads via an parallel Best-First-MiniMax search and classic evaluation.
Very interesting, thanks.

When using large neural networks, evaluation is so slow that data transfers between CPU and GPU have very little cost.
GPUs have far higher-bandwidth RAM than the CPU. Case in point, a $400-class GPU like Vega64 has 480 GB/s bandwidth, while the NVidia 2060 SUPER (also $400) has 448 GB/s bandwidth. In contrast, typical Desktop CPUs only have 50 GB/s or less (Dual Channel DDR4 2666 == dual channel for 42GB-per-second theoretical). A Quad-channel HEDT (Threadripper, Cascade Lake X) may reach 100GB/s, but in any case, is far less bandwidth than a GPU.

Note: the latency of GPUs is much worse than CPUs, hundreds of cycles (and slower-cycles at that) to access L2 cache, let alone its main memory. The only way you actually benefit from this increased bandwidth is if you can parallelize the traversal. I have seen some papers with various parallel MCTS traversals proposed, but I think its probably a research problem worth exploring.

-------

I agree with you however, that large neural nets are too slow for MCTS to achieve much benefit from this. But perhaps "lighter" scale MCTS (kind of a fast-and-light approach like Stockfish: targeting Mega-Nodes/second or Giga-nodes/second speeds) would become possible if MCTS scaled up in speed a bit better.

With only 100s Kilo-nodes/second of typical neural nets, that's slow enough for PCIe to work out, so you can do MCTS on CPUs just fine.