Multithreaded batching on GPU for montecarlo and also alpha-beta

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Multithreaded batching on GPU for montecarlo and also alpha-beta

Post by Daniel Shawul »

I was experimenting on multi-threaded batching for inference on the GPU and it works really well.
It can be used with any multi-threaded search algorithm without forcing you to rewrite your algorithm.

Note that Lc0 currently uses a single-threaded batching method that forces you to rewrite your search, while A0 probably uses
tensoflow-serving with muti-threaded batching. The AGO-0 paper mentions they just need to batch 8 positions on a 20x256 or 40x256 network
to be efficient. What Lc0 does is the following: A single search thread "starts" multiple monte-carlo simulations without actually finishing them (i.e. the NN eval at the tip is not yet complete). This is done for the specified mini-batch size (128 or so), followed by network evaluation
for all positions simultaneously, and finally update the tree with the results of the evaluations at the tips.
This is very involved and requires re-writing the MCTS algorithm and infact I got tired of trying to make this work in my code
which has all sorts of mcts+alpha-beta algorithms.

On the other hand, multi-threaded batching doesn't need me to re-write algorithms and can be used even with alpha-beta,
as well as being more efficient in collecting batches for evaluation. Each searcher thread requests for evaluation of a single
position (eval()) and then blocks until it gets the result. The server (which in my case is egbbll) batches requests from multiple
threads (one from each thread) and does the batch evaluation and returns the result for each thread. The thread blocking is
actually done by egbbdll so the chess playing program doesn't have to do anything special.

It turns out I can launch significantly more threads than the number of available cores and be equally efficient as a case where
each searcher thread has its own core. For example, on a network of size 12x128, a search done with 4-cores overspecified with
128 threads gives the same nps as using 32-cores with 128 threads. The GPU is a Tesla P100, and the CPU is 32-core intel Xeon (two 16-core CPUs in two sockets).

Results:

Playouts per second on the CPU (batching doesn't help here)

Code: Select all

1-thread = 33
n-threads = <33
Playouts per second on the GPU

Code: Select all

1-thread = 226
128-threads using all 64 cores = 2872
128-threads using just 4 cores = 2575
I get about 2872/33 = 87 times speedup relative to a single-core CPU and about 2872/226=12 times speedup comes from batching.

Detailed results
============

1-core CPU

Code: Select all

$ ./scorpio use_nn 1 mt 1 montecarlo 1 frac_alphabeta 0 backup_type 4 book off go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [1]
EgbbProbe 4.1 by Daniel Shawul
180 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 1s
[st = 11114ms, mt = 29250ms , hply = 0 , moves_left 10]
63 0 111 1071  e2-e4 e7-e5
64 0 225 2707  e2-e4 e7-e5
65 0 339 4479  d2-d4 d7-d5 Nb1-c3
66 0 452 6164  d2-d4 d7-d5 Nb1-c3 Ng8-f6
67 0 565 8093  d2-d4 d7-d5 Nb1-c3 Ng8-f6
68 0 677 11687  e2-e4 e7-e5 Ng1-f3 Nb8-c6
69 0 788 16331  d2-d4 d7-d5 Nb1-c3 Ng8-f6 e2-e3
70 0 899 20810  e2-e4 e7-e5 Ng1-f3 Ng8-f6 d2-d3
71 0 1012 26998  e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6

#  1      0      62 e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6 Nb1-c3
#  2      0      62 d2-d4 d7-d5 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3
#  3      0      24 Nb1-c3 d7-d5 e2-e4 d5xe4 Nc3xe4 e7-e5 Ng1-f3 Nb8-c6
#  4      0      32 Ng1-f3 d7-d5 d2-d3 e7-e6 e2-e4 Nb8-c6 Nb1-c3
#  5      0      23 e2-e3 d7-d5 Ng1-f3 Ng8-f6 Nb1-c3 e7-e6 d2-d4 Nb8-c6 h2-h3
#  6      0      28 d2-d3 e7-e5 Ng1-f3 Nb8-c6 e2-e4 Ng8-f6 Nb1-c3 d7-d5 Nc3xd5
#  7      0      23 g2-g3 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 f2-f3 Nb8-c6 Nb1-c3 Qd5-d4 d2-d3
#  8     -3      12 b2-b3 d7-d5 Ng1-f3 Ng8-f6 d2-d3
#  9     -5      10 f2-f3 d7-d5 d2-d4 Ng8-f6
# 10     -9       8 h2-h3 d7-d5 d2-d4 Ng8-f6
# 11     -7       9 c2-c3 d7-d5 d2-d4
# 12    -11       8 a2-a3 d7-d5 d2-d4
# 13    -11       8 Nb1-a3 d7-d5 d2-d4 e7-e6
# 14     -6      10 Ng1-h3 e7-e5 e2-e4 Ng8-f6
# 15    -11       8 f2-f4 Ng8-f6 Ng1-f3 d7-d5
# 16      0      37 c2-c4 e7-e5 d2-d3 Ng8-f6 Ng1-f3 Nb8-c6 Nb1-c3 d7-d6 e2-e4 Bc8-e6 Bc1-e3
# 17    -15       7 h2-h4 d7-d5 d2-d4
# 18    -13       7 a2-a4 d7-d5 d2-d4
# 19    -15       7 g2-g4 d7-d5 e2-e3
# 20    -22       6 b2-b4 e7-e5 Bc1-b2

nodes = 37778 <95% qnodes> time = 11149ms nps = 3388 eps = 2865 nneps = 30
Tree: nodes = 10070 depth = 10 pps = 33 visits = 372 
      qsearch_calls = 9702 search_calls = 0
move e2e4
Bye Bye
1-thread GPU

Code: Select all

$ ./scorpio use_nn 1 mt 1 montecarlo 1 frac_alphabeta 0 backup_type 4 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [1]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 6s
[st = 11114ms, mt = 29250ms , hply = 0 , moves_left 10]
63 0 111 8032  d2-d4 d7-d5 e2-e3
64 0 222 16376  d2-d4 d7-d5 Nb1-c3 Ng8-f6
65 0 333 28681  d2-d4 d7-d5 Nb1-c3 Ng8-f6
66 0 445 41847  e2-e4 e7-e5 Ng1-f3 Nb8-c6
67 0 557 55091  d2-d4 d7-d5 Nb1-c3 Ng8-f6
68 0 668 74011  d2-d4 d7-d5 Nb1-c3 Ng8-f6 e2-e3
69 0 779 95658  e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3
70 0 891 123039  e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3
71 0 1002 166551  e2-e4 e7-e5 Ng1-f3 Ng8-f6 d2-d3

#  1      0     127 e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6 Nb1-c3
#  2      0     127 d2-d4 d7-d5 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3 h7-h6 a2-a3
#  3      0     127 Nb1-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3 h7-h6
#  4      0     127 Ng1-f3 d7-d5 d2-d3 e7-e6 Nb1-c3 Nb8-c6 e2-e4 Ng8-f6 h2-h3 Bf8-d6
#  5      0     127 e2-e3 d7-d5 Ng1-f3 Ng8-f6 Nb1-c3 e7-e6 d2-d4 Nb8-c6 h2-h3 h7-h6 a2-a3 a7-a6 Bf1-d3
#  6      0     127 d2-d3 e7-e5 Ng1-f3 Nb8-c6 e2-e4 Ng8-f6 Nb1-c3 d7-d5
#  7      0     127 g2-g3 e7-e5 Ng1-f3 Nb8-c6 d2-d3 Ng8-f6 Bf1-g2 d7-d5 Ke1-g1 Bc8-e6 Bc1-d2 h7-h6 Nb1-c3 a7-a6 e2-e4 d5-d4 Nc3-e2 Bf8-d6
#  8      0     127 b2-b3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 e2-e3 e7-e6 Bc1-b2
#  9      0     127 f2-f3 e7-e5 e2-e4 Nb8-c6 Nb1-c3 Ng8-f6 a2-a3 d7-d5 Nc3xd5
# 10      0     127 h2-h3 e7-e5 Ng1-f3 Nb8-c6 e2-e4 d7-d5 e4xd5 Qd8xd5 Nb1-c3 Qd5-d8 Bf1-d3
# 11      0     127 c2-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 Bc8-f5 e2-e3 e7-e6 Nb1-d2
# 12      0     127 a2-a3 e7-e5 e2-e4 d7-d6 Nb1-c3 Ng8-f6 d2-d4 e5xd4 Qd1xd4 Nb8-c6 Qd4-d1
# 13      0     126 Nb1-a3 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d3 Bf8xa3 b2xa3 Nb8-c6 Bc1-e3
# 14      0     126 Ng1-h3 e7-e5 e2-e4 Ng8-f6 Nb1-c3 d7-d5 e4xd5 Bc8xh3 g2xh3 Nf6xd5 d2-d4 Nd5xc3 b2xc3
# 15      0     126 f2-f4 Ng8-f6 Nb1-c3 d7-d5 e2-e3 Nb8-c6 d2-d4 a7-a6 a2-a3 e7-e6 Ng1-f3 h7-h6 Bf1-d3 Bf8-d6 Ke1-g1 Ke8-g8 Bc1-d2
# 16      0     126 c2-c4 e7-e5 Ng1-f3 e5-e4 Nf3-e5 d7-d6 Qd1-a4 Nb8-d7 Ne5-g4
# 17      0     126 h2-h4 e7-e5 Ng1-f3 Bf8-d6 d2-d3 Nb8-c6 e2-e4 Ng8-f6 Nb1-c3 Ke8-g8
# 18      0     126 a2-a4 e7-e5 e2-e4 Ng8-f6 d2-d3 d7-d5 e4xd5 Nf6xd5 Ng1-f3 Bf8-b4 Bc1-d2 f7-f6
# 19      0     126 g2-g4 d7-d5 d2-d4 Nb8-c6 e2-e3 e7-e5 Nb1-c3 Ng8-f6 d4xe5 Nf6xg4 Qd1xd5 Nc6xe5 Qd5-d4 Qd8-f6 Qd4-f4
# 20      0     126 b2-b4 e7-e5 Bc1-b2 Ng8-f6 e2-e3 d7-d5 Bb2xe5 Bf8xb4 Ng1-f3 Nb8-d7 Bf1-b5 a7-a6 a2-a3 a6xb5 a3xb4 Ra8-a4 Nb1-c3 Ra4xa1 Qd1xa1

nodes = 231367 <94% qnodes> time = 11118ms nps = 20810 eps = 17422 nneps = 191
Tree: nodes = 68047 depth = 18 pps = 226 visits = 2513 
      qsearch_calls = 65549 search_calls = 0
move e2e4
Bye Bye
128 threads on 32 cores using GPU

Code: Select all

 
$ ./scorpio use_nn 1 mt 128 montecarlo 1 frac_alphabeta 0 backup_type 4 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [128]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 7s
[st = 11114ms, mt = 29250ms , hply = 0 , moves_left 10]
63 0 148 115  e2-e4 e7-e5 d2-d3
64 0 261 1583  e2-e4 Ng8-f6 e4-e5 Nf6-e4 Nb1-c3
65 0 374 2702  e2-e4 d7-d5 d2-d3 Ng8-f6 Nb1-c3 d5xe4 d3xe4 Qd8xd1 Ke1xd1
66 0 489 4878  e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3
67 0 605 7938  e2-e4 e7-e5 d2-d3 Nb8-c6 Ng1-f3
68 0 719 10648  e2-e4 e7-e5 d2-d3 Nb8-c6 Ng1-f3
69 0 831 13836  e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6
70 0 945 18123  e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6 Nb1-c3
71 0 1061 22783  e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6 Bf1-c4 Bf8-b4 c2-c3 Bb4-d6 Bc4xf7 Ke8-e7

#  1      0    1626 e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5 Ng8-f6 Bf1-c4 Bf8-b4 c2-c3 Bb4-d6 Ne5xf7
#  2      0    1588 d2-d4 d7-d5 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3 h7-h6 a2-a3 a7-a6
#  3      0    1655 Nb1-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3 h7-h6 a2-a3 a7-a6 Bf1-d3 Bf8-d6 Ke1-g1
#  4      0    1577 Ng1-f3 d7-d5 d2-d4 Ng8-f6 c2-c3 Nb8-c6 h2-h3 Bc8-f5 e2-e3 e7-e6 Nb1-d2
#  5      0    1610 e2-e3 d7-d5 Ng1-f3 Nb8-c6 Nb1-c3 e7-e5 d2-d4 e5-e4 Nf3-d2 Ng8-f6 Bf1-e2
#  6      0    1606 d2-d3 e7-e5 e2-e4 Nb8-c6 Bc1-e3 d7-d6 Nb1-c3 Ng8-f6 Ng1-f3 Bc8-e6 a2-a3 h7-h6 h2-h3
#  7      0    1645 g2-g3 d7-d5 d2-d4 Nb8-c6 Ng1-f3 Ng8-f6 Bf1-g2 h7-h6 Nb1-c3 e7-e6 Ke1-g1 a7-a6 a2-a3 Bf8-d6
#  8      0    1621 b2-b3 e7-e5 e2-e4 d7-d6 Bc1-b2 Ng8-f6 Nb1-c3 Nb8-c6 Ng1-f3
#  9      0    1645 f2-f3 e7-e5 e2-e4 Bf8-c5 d2-d3 d7-d5
# 10      0    1645 h2-h3 e7-e5 e2-e4 d7-d5 d2-d4 e5xd4 e4xd5 Bf8-b4 Nb1-d2 Qd8xd5 Ng1-e2 Qd5-e4
# 11      0    1631 c2-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6
# 12      0    1620 a2-a3 e7-e5 e2-e4 d7-d6 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6
# 13      0    1607 Nb1-a3 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d3 Bf8-b4 c2-c3
# 14      0    1566 Ng1-h3 d7-d5 d2-d4 Nb8-c6 Nh3-g5 Ng8-f6 Nb1-c3 e7-e6
# 15      0    1619 f2-f4 d7-d5 d2-d3 g7-g6 e2-e4
# 16      0    1634 c2-c4 e7-e5 d2-d3 Ng8-f6 Nb1-c3 d7-d6 Ng1-f3 Nb8-c6
# 17      0    1544 h2-h4 e7-e5 e2-e4 Ng8-f6 Ng1-f3 d7-d6
# 18      0    1631 a2-a4 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 Nb1-c3 Qd5-d8
# 19      0    1561 g2-g4 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 f2-f3 Bc8xg4 Nb1-c3
# 20      0    1635 b2-b4 d7-d5 d2-d3 Ng8-f6 Ng1-f3 e7-e5 b4-b5 Nb8-d7 Bc1-b2 Bf8-b4 Nb1-c3 Ke8-g8 h2-h3 c7-c6 b5xc6

nodes = 10380412 <29% qnodes> time = 11226ms nps = 924675 eps = 234964 nneps = 2857
Tree: nodes = 884071 depth = 21 pps = 2872 visits = 32247 
      qsearch_calls = 6731 search_calls = 0
move e2e4
Bye Bye
128 threads on 4 cores using GPU. I use taskset to restrict to just use 4-cores

Code: Select all

$ taskset f0000000 ./scorpio use_nn 1 mt 128 montecarlo 1 frac_alphabeta 0 backup_type 4 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [128]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 6s
[st = 11114ms, mt = 29250ms , hply = 0 , moves_left 10]
63 0 112 96  Nb1-c3 d7-d5 Ng1-f3
64 0 226 1150  e2-e4 e7-e5 Ng1-f3 Nb8-c6
65 0 344 2541  e2-e4 d7-d5 d2-d3 Ng8-f6 Nb1-c3
66 0 466 5004  e2-e4 d7-d5 d2-d3 Ng8-f6 Nb1-c3 d5xe4 d3xe4 Qd8xd1 Ke1xd1
67 0 580 6356  e2-e4 e7-e5 d2-d3 Nb8-c6 Ng1-f3
68 0 693 8331  e2-e4 e7-e5 d2-d3 Nb8-c6 Ng1-f3
69 0 808 10945  e2-e4 e7-e5 d2-d3 Nb8-c6 Ng1-f3 Ng8-f6
70 0 920 14370  e2-e4 e7-e5 d2-d4 d7-d5 Ng1-f3 d5xe4 Nf3xe5
71 0 1031 18653  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4 Nb8-c6 Qd4-d1 d7-d6 Ng1-f3 Ng8-f6 Nb1-c3

#  1      0    1371 e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4 Nb8-c6 Qd4-d1 Bf8-d6 Ng1-f3 Ng8-f6 Nb1-c3 Ke8-g8 Bc1-e3 a7-a6 a2-a3 Rf8-e8 Bf1-d3 b7-b6
#  2      0    1091 d2-d4 d7-d5 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3 e7-e6 e2-e3 h7-h6 a2-a3 a7-a6 Bf1-d3 Bf8-d6
#  3      0     968 Nb1-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3
#  4      0    1700 Ng1-f3 d7-d5 d2-d4 Ng8-f6 h2-h3 Nb8-c6 Nb1-c3 e7-e6 e2-e3 h7-h6 a2-a3 a7-a6 Bf1-d3 Bf8-d6 Ke1-g1 Ke8-g8 Bc1-d2 Bc8-d7 e3-e4 Nf6xe4
#  5      0    1595 e2-e3 d7-d5 d2-d4 Ng8-f6 Bf1-d3 Nb8-c6 Ng1-f3 e7-e6
#  6      0    1085 d2-d3 e7-e5 e2-e4 Nb8-c6 Bc1-e3 d7-d6 Nb1-c3 Ng8-f6 Ng1-f3 Bc8-e6
#  7      0    1684 g2-g3 d7-d5 d2-d4 Nb8-c6 Ng1-f3 Ng8-f6 Bf1-g2 h7-h6 Nb1-c3 e7-e6 Ke1-g1 a7-a6
#  8      0    1970 b2-b3 e7-e5 e2-e4 d7-d6 Bc1-b2 Ng8-f6 Nb1-c3 Nb8-c6 Ng1-f3 Bc8-e6 d2-d4 Nc6xd4
#  9      0    1364 f2-f3 e7-e5 e2-e4 Bf8-c5 d2-d3 d7-d5
# 10      0    1678 h2-h3 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 Nb1-c3
# 11      0    1692 c2-c3 d7-d5 d2-d4 Ng8-f6 Ng1-f3 Nb8-c6 h2-h3
# 12      0    1600 a2-a3 e7-e5 e2-e4 d7-d6 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 d2-d4
# 13      0    1932 Nb1-a3 e7-e5 e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d3 Ng8-f6 Bc1-d2 Bf8xa3 b2xa3 Ke8-g8 Ng1-e2 Bc8-g4 f2-f3
# 14      0    1478 Ng1-h3 d7-d5 d2-d4 Nb8-c6 Nb1-c3 e7-e6 e2-e3 a7-a6 Nh3-f4
# 15      0    1745 f2-f4 d7-d5 d2-d3 Ng8-f6 Ng1-f3 Nb8-c6
# 16      0    1383 c2-c4 e7-e5 Ng1-f3 e5-e4 Nf3-e5 d7-d6 Qd1-a4 Nb8-d7 Ne5-g4 h7-h5
# 17      0    1323 h2-h4 e7-e5 e2-e4 Ng8-f6 Ng1-f3 Nb8-c6 Nb1-c3
# 18      0    1352 a2-a4 e7-e5 e2-e4 d7-d5 d2-d4 d5xe4 d4xe5
# 19      0    1839 g2-g4 d7-d5 d2-d4 Nb8-c6 e2-e3 e7-e5 Nb1-c3 Ng8-f6 d4xe5 Nf6xg4 Qd1xd5 Nc6xe5 e3-e4 Qd8-f6 Qd5-d4 c7-c6
# 20      0    1356 b2-b4 e7-e5 Bc1-b2 e5-e4 e2-e3 d7-d5 Ng1-e2 Ng8-f6 b4-b5 Bf8-d6 d2-d4 Ke8-g8 Nb1-c3 c7-c6 f2-f3 Rf8-e8 f3xe4 d5xe4

nodes = 8396172 <35% qnodes> time = 11722ms nps = 716274 eps = 218633 nneps = 2538
Tree: nodes = 821310 depth = 22 pps = 2575 visits = 30187 
      qsearch_calls = 6038 search_calls = 0
move e2e4
Bye Bye
I have noted that using sched_yield() versus usleep(0) to block threads results in different behaviour, though
I expected that they will be same. If I use the former, I don't get any benefits from batching. Does anybody
know the exact difference between sched_yield() and usleep(0) or in windows Sleep(0) and SwitchToThread()?

regards,
Daniel
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Multithreaded batching on GPU for montecarlo and also alpha-beta

Post by mar »

Daniel Shawul wrote: Mon Sep 03, 2018 4:31 pm Does anybody
know the exact difference between sched_yield() and usleep(0) or in windows Sleep(0) and SwitchToThread()?
I don't know about Linux, but there's an attempt to recreate open source version of Windows (including kernel) called ReactOS.
I'm not sure how much it resembles Windows kernel/scheduler at low level, but here are some links to how SwitchToThread and Sleep are implemented,
if you feel adventurous to dig through low level stuff:
SwitchToThread:
https://doxygen.reactos.org/d0/d85/dll_ ... ource.html
Sleep:
https://doxygen.reactos.org/d1/d0d/synch_8c_source.html
Martin Sedlak
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Multithreaded batching on GPU for montecarlo and also alpha-beta

Post by Joost Buijs »

Daniel Shawul wrote: Mon Sep 03, 2018 4:31 pm
I have noted that using sched_yield() versus usleep(0) to block threads results in different behaviour, though
I expected that they will be same. If I use the former, I don't get any benefits from batching. Does anybody
know the exact difference between sched_yield() and usleep(0) or in windows Sleep(0) and SwitchToThread()?
For Windows Microsoft says this about it:

Sleep()

A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution.

Switch_to_thread()

Causes the calling thread to yield execution to another thread that is ready to run on the current processor. The operating system selects the next thread to be executed. If there are no other threads ready to execute, the operating system does not switch execution to another thread.

The difference seems to to be that Sleep(0) switches only to threads with the same priority (on any processor), and Switch_to_thread() switches only to a thread on the same processor (in case of a multiprocessor setup) and that this can be a thread with any priority determined by the OS.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Multithreaded batching on GPU for montecarlo and also alpha-beta

Post by Daniel Shawul »

Thanks Martin & Joost. Indeed that slight difference seems to what was causing the problem.
Ideally I would like any thread on any core to take up the time slice given by a blocked thread.

Btw here is an example of a heavily pruned alpha-beta search tree with 12x128 NN eval. I need to prune
heavily because even with a GPU full-width search is infeasible. I get similar evaluations / sec as an MCTS search

Using SHT parallelization I get 3k nps, which is even higher than the MCTS pps (probably due to luck)

Code: Select all

$./scorpio use_nn 1 st 20 mt 128 smp_type=SHT montecarlo 0 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [128]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 7s
[st = 20000ms, mt = 20000ms , hply = 0]
2 32 7 236  e2-e4 e7-e5
3 -50 16 778  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
4 33 20 1202  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
5 -6 46 4078  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
6 28 82 7133  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
6 75 102 12736  d2-d4 d7-d5 e2-e4 d5xe4 Ng1-e2 e7-e5 d4xe5
7 5 146 15984  d2-d4 d7-d5 e2-e4 d5xe4 Ng1-e2 e7-e5 d4xe5
7 13 172 16013  e2-e4 e7-e5 d2-d4 Ng8-f6 d4xe5 Nf6xe4 Ng1-e2
8 30 225 28583  e2-e4 e7-e5 Ng1-e2 Ng8-f6 Nb1-c3 d7-d5 e4xd5 Nf6xd5 Nc3xd5 Qd8xd5
9 -10 424 59729  e2-e4 Ng8-f6 d2-d4 Nf6xe4 Nb1-c3 Ne4xc3 b2xc3 d7-d5 Ng1-e2
9 -2 498 59837  d2-d4 Nb8-c6 e2-e4 Ng8-f6 Ng1-e2 d7-d5 e4xd5 Nf6xd5 Nb1-c3 Nd5xc3 b2xc3
10 28 868 136530  d2-d4 Nb8-c6 f2-f3 d7-d5 e2-e4 Ng8-f6 Ng1-e2 d5xe4 f3xe4 Nf6xe4
11 20 1166 173392  d2-d4 Nb8-c6 Nb1-c3 Ng8-f6 e2-e4 d7-d5 e4xd5 Nf6xd5 Nc3xd5 Qd8xd5 Ng1-e2 e7-e5
12 8 1978 251268  d2-d4 Nb8-c6 e2-e3 d7-d5 f2-f3 Ng8-f6 Ng1-e2 e7-e5 e3-e4 d5xe4 f3xe4 e5xd4
splits = 0 badsplits = 0 egbb_probes = 0
nodes = 405295 <25 qnodes> time = 23414ms nps = 17309 eps = 21689  nneps = 3168
move d2d4
Bye Bye
The story is different for YBW because there is dependencises between search threads so we never really get to evaluate 128
positions simultaneously. It doesn't scale well above 2 threads with which nneps=307 -- nearly a 10x slow down from SHT.
Although I think this can be improved by batching multiple evaluation requests from each thread.

Code: Select all

$ ./scorpio use_nn 1 st 20 mt 1 smp_type=YBW montecarlo 0 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [1]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 6s
[st = 20000ms, mt = 20000ms , hply = 0]
2 32 1 23  e2-e4 e7-e5
3 -50 1 36  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
4 33 2 41  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
5 -6 4 79  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
6 28 7 114  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
7 -13 11 168  e2-e4 d7-d5 e4xd5 e7-e5 d5xe6 f7xe6 Ng1-e2
7 22 14 205  d2-d4 e7-e5 d4xe5 d7-d5 e5xd6 Bf8xd6 e2-e4
8 26 23 307  d2-d4 e7-e5 d4xe5 d7-d5 e5xd6 Bf8xd6 e2-e4 Ng8-f6
9 -16 70 973  d2-d4 d7-d5 Nb1-c3 Nb8-c6 e2-e4 d5xe4 Nc3xe4 Nc6xd4 Ne4-c3
9 11 86 1232  e2-e4 e7-e5 Nb1-c3 d7-d5 Nc3xd5 Nb8-c6 d2-d4 e5xd4 Nd5-c3
10 9 221 3058  e2-e4 e7-e5 Nb1-c3 Ng8-f6 Ng1-e2 Bf8-e7 d2-d4 e5xd4 Ne2xd4 Nb8-c6 Nd4xc6 d7xc6
11 0 518 6711  e2-e4 d7-d6 d2-d4 Ng8-f6 Ng1-e2 Nf6xe4 Nb1-c3 Ne4xc3 b2xc3 Nb8-c6 Bc1-b2
11 3 571 7404  Ng1-f3 d7-d5 Nb1-c3 Bc8-d7 Nc3xd5 Nb8-c6 d2-d4 Ng8-f6 Nd5xf6 g7xf6 Bc1-d2 e7-e5
12 20 740 9711  Ng1-f3 Ng8-f6 Nb1-c3 d7-d6 d2-d4 d6-d5 e2-e3 Nb8-c6 h2-h3 e7-e5 d4xe5 Bf8-e7
13 16 975 13068  Ng1-f3 d7-d5 d2-d4 Nb8-c6 Nb1-c3 e7-e6 Bc1-d2 Ng8-f6 h2-h3 Bf8-e7 e2-e4 d5xe4 Nc3-e2
14 28 2242 29468  Ng1-f3 Ng8-f6 d2-d4 e7-e6 Nb1-c3 h7-h6 g2-g3 d7-d5 Bc1-d2 Bf8-e7 Bf1-g2 Ke8-g8 Ke1-g1 Nb8-c6
splits = 0 badsplits = 0 egbb_probes = 0
nodes = 30014 <34 qnodes> time = 22840ms nps = 1314 eps = 1768  nneps = 196
move g1f3
Bye Bye
$ ./scorpio use_nn 1 st 20 mt 2 smp_type=YBW montecarlo 0 go quit
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 524288 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
treeht 419430400 X 32 = 12800.0 MB
processors [1]
processors [2]
EgbbProbe 4.1 by Daniel Shawul
0 egbbs loaded !      
Loading neural network....
Neural network loaded !      
loading_time = 6s
[st = 20000ms, mt = 20000ms , hply = 0]
2 32 1 23  e2-e4 e7-e5
3 -50 2 36  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
4 33 2 41  e2-e4 e7-e5 d2-d4 e5xd4 Qd1xd4
5 -6 4 79  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
6 28 7 114  e2-e4 d7-d5 e4xd5 Qd8xd5 d2-d4 Qd5-e4 Ng1-e2
7 -13 12 168  e2-e4 d7-d5 e4xd5 e7-e5 d5xe6 f7xe6 Ng1-e2
7 22 15 205  d2-d4 e7-e5 d4xe5 d7-d5 e5xd6 Bf8xd6 e2-e4
8 26 24 307  d2-d4 e7-e5 d4xe5 d7-d5 e5xd6 Bf8xd6 e2-e4 Ng8-f6
9 -16 76 973  d2-d4 d7-d5 Nb1-c3 Nb8-c6 e2-e4 d5xe4 Nc3xe4 Nc6xd4 Ne4-c3
9 11 95 1232  e2-e4 e7-e5 Nb1-c3 d7-d5 Nc3xd5 Nb8-c6 d2-d4 e5xd4 Nd5-c3
10 18 178 2916  e2-e4 Nb8-c6 Ng1-e2 Ng8-f6 d2-d4 Nf6xe4 Nb1-c3 Ne4xc3 b2xc3 d7-d5
11 5 298 5117  e2-e4 e7-e5 Ng1-e2 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Ne2xd4 Nc6xd4 Qd1xd4 Bf8-e7
11 22 437 7036  Ng1-f3 d7-d6 h2-h3 Nb8-c6 e2-e4 d6-d5 e4xd5 Qd8xd5 Nb1-c3 Qd5-e6 Nc3-e2 Bc8-d7
12 29 671 11429  Ng1-f3 d7-d6 Nb1-c3 Nb8-c6 e2-e4 Ng8-f6 Bf1-d3 e7-e5 Ke1-g1 Bf8-e7 Nc3-e2 Bc8-d7
13 15 900 16142  Ng1-f3 d7-d6 d2-d4 Nb8-c6 Nb1-c3 e7-e5 d4xe5 d6xe5 Bc1-d2 Ng8-f6 e2-e4 Bc8-d7 Nc3-e2
splits = 88 badsplits = 19 egbb_probes = 0
nodes = 41162 <38 qnodes> time = 21115ms nps = 1949 eps = 2632  nneps = 307
move g1f3
Bye Bye

DustyMonkey
Posts: 61
Joined: Wed Feb 19, 2014 10:11 pm

Re: Multithreaded batching on GPU for montecarlo and also alpha-beta

Post by DustyMonkey »

All you guys trying to optimize for GPU, I do not know where exactly to look but you need to get an in-depth view of the target architectures. All the latest GPU's have multiple domain-specific chips one of which is a controller that commands the very wide simd'd vector processor(s). This controller chip and how it talks the vector processor(s) is important but is rarely, if ever, detailed or even talked about.

What is needed is something like Agner Fog's detailed optimization analysis for CPU's, but instead for GPU's.

It is hard to make reasonable optimization decisions without...