Pigeon is now running on the GPU

StuartRiffle · Post by **StuartRiffle** » Tue Nov 08, 2016 2:20 am

Yes, negamax w/ alpha-beta pruning.

Right now the CPU is doing most of the tree, but the lowest couple of levels are lopped off and done asynchronously on the GPU (the quiescence search too). It's an iterative search implementation, so the "callstack" on the CPU side can be easily changed, and when the GPU results come back, I just need to modify the search in progress to reflect what has been learned.

Turns out that's trickier than it sounds. :/

My code is doing something with it, but it's not working yet. Scores are getting munged somewhere, and the end result is poor.

StuartRiffle · Post by **StuartRiffle** » Wed Nov 09, 2016 3:28 am

Still not all working on the CPU side, but I've got a debug trace with some numbers to share, for those who like that sort of thing.

This run is using blocks of 128 threads to process batches of 4096 searches, so each GPU thread is doing 32 searches sequentially. Each search here is one level deep, plus quiescence. Where it says "most steps", that measures the number of passes through the iterated negamax loop of the longest running thread. (It usually requires ~2 steps per node). After doing so many searches, the thread runtimes tend to balance out, but sometimes one or two threads get unlucky and end up doing a lot more work, while the rest of the warp is already done and spinning idle. That shows up as horrible spikes:

Code: Select all

4096 jobs,  87632 nodes, GPU time  74.7ms, CPU latency 384.3ms, most steps  416, nps 1173k
4096 jobs, 106149 nodes, GPU time  92.2ms, CPU latency 472.9ms, most steps  524, nps 1151k
4096 jobs,  70716 nodes, GPU time  41.0ms, CPU latency 429.4ms, most steps  299, nps 1722k
4096 jobs,  80498 nodes, GPU time  49.9ms, CPU latency 400.9ms, most steps  434, nps 1614k
4096 jobs,  67654 nodes, GPU time  58.9ms, CPU latency 431.0ms, most steps  614, nps 1148k
4096 jobs,  78862 nodes, GPU time  68.6ms, CPU latency 458.0ms, most steps  753, nps 1150k
4096 jobs,  79425 nodes, GPU time  63.5ms, CPU latency 486.6ms, most steps  422, nps 1250k
4096 jobs,  80506 nodes, GPU time  86.5ms, CPU latency 528.6ms, most steps 1204, nps  930k
4096 jobs,  92748 nodes, GPU time  68.1ms, CPU latency 518.8ms, most steps  624, nps 1362k
4096 jobs,  82805 nodes, GPU time 106.5ms, CPU latency 532.4ms, most steps 1636, nps  777k
4096 jobs,  84516 nodes, GPU time  33.5ms, CPU latency 522.8ms, most steps  287, nps 2525k
4096 jobs, 107581 nodes, GPU time  64.1ms, CPU latency 523.3ms, most steps  660, nps 1679k
4096 jobs, 114540 nodes, GPU time  61.5ms, CPU latency 525.2ms, most steps  373, nps 1861k
4096 jobs, 109486 nodes, GPU time  49.0ms, CPU latency 501.7ms, most steps  344, nps 2234k
4096 jobs,  70930 nodes, GPU time  36.2ms, CPU latency 495.9ms, most steps  507, nps 1956k
4096 jobs,  90644 nodes, GPU time 222.1ms, CPU latency 630.8ms, most steps 1989, nps  408k  <-- :(
4096 jobs,  84284 nodes, GPU time  54.9ms, CPU latency 612.0ms, most steps  809, nps 1534k
4096 jobs,  73404 nodes, GPU time  39.6ms, CPU latency 554.2ms, most steps  222, nps 1853k
4096 jobs,  61860 nodes, GPU time  29.0ms, CPU latency 548.1ms, most steps  172, nps 2135k
4096 jobs,  84171 nodes, GPU time  65.7ms, CPU latency 549.8ms, most steps  527, nps 1280k
4096 jobs,  85119 nodes, GPU time  53.0ms, CPU latency 529.1ms, most steps  433, nps 1605k
4096 jobs,  59988 nodes, GPU time  33.3ms, CPU latency 513.0ms, most steps  292, nps 1800k
4096 jobs,  73250 nodes, GPU time  61.9ms, CPU latency 552.2ms, most steps  561, nps 1183k
4096 jobs, 103216 nodes, GPU time  66.6ms, CPU latency 381.6ms, most steps  842, nps 1549k
4096 jobs,  89989 nodes, GPU time  31.3ms, CPU latency 597.0ms, most steps  203, nps 2872k
4096 jobs,  75139 nodes, GPU time  54.3ms, CPU latency 593.3ms, most steps  352, nps 1384k
4096 jobs,  76158 nodes, GPU time  33.7ms, CPU latency 527.8ms, most steps  325, nps 2258k
4096 jobs,  98841 nodes, GPU time  45.9ms, CPU latency 433.1ms, most steps  530, nps 2155k
4096 jobs,  73504 nodes, GPU time  53.4ms, CPU latency 449.4ms, most steps  450, nps 1377k
4096 jobs,  87670 nodes, GPU time  37.0ms, CPU latency 370.6ms, most steps  248, nps 2371k
4096 jobs,  60932 nodes, GPU time  28.1ms, CPU latency 351.3ms, most steps  420, nps 2167k
4096 jobs,  83921 nodes, GPU time  93.4ms, CPU latency 363.2ms, most steps  712, nps  898k
4096 jobs,  71281 nodes, GPU time 188.0ms, CPU latency 520.2ms, most steps  901, nps  379k
4096 jobs,  61097 nodes, GPU time  48.7ms, CPU latency 516.4ms, most steps  516, nps 1255k
4096 jobs,  73361 nodes, GPU time  69.2ms, CPU latency 554.3ms, most steps  635, nps 1060k
4096 jobs,  61694 nodes, GPU time  45.7ms, CPU latency 553.1ms, most steps  467, nps 1348k
4096 jobs,  44942 nodes, GPU time  35.8ms, CPU latency 536.2ms, most steps  409, nps 1255k
4096 jobs,  67500 nodes, GPU time  66.7ms, CPU latency 569.8ms, most steps  640, nps 1012k
4096 jobs,  65230 nodes, GPU time  47.9ms, CPU latency 586.5ms, most steps  368, nps 1361k
4096 jobs,  21984 nodes, GPU time  36.1ms, CPU latency 533.9ms, most steps  576, nps  609k
4096 jobs,  46154 nodes, GPU time  96.3ms, CPU latency 438.3ms, most steps  605, nps  479k
4096 jobs,  70606 nodes, GPU time 124.4ms, CPU latency 515.4ms, most steps  683, nps  567k
4096 jobs,  51403 nodes, GPU time  53.8ms, CPU latency 494.8ms, most steps  660, nps  954k
4096 jobs,  60762 nodes, GPU time 123.3ms, CPU latency 573.2ms, most steps 2673, nps  492k  <-- :(
4096 jobs,  40601 nodes, GPU time  69.6ms, CPU latency 608.5ms, most steps  457, nps  583k
4096 jobs,  58776 nodes, GPU time  41.8ms, CPU latency 583.9ms, most steps  247, nps 1405k
4096 jobs,  56152 nodes, GPU time  35.1ms, CPU latency 570.4ms, most steps  235, nps 1601k
4096 jobs,  56891 nodes, GPU time  27.1ms, CPU latency 565.2ms, most steps  217, nps 2099k
4096 jobs,  61622 nodes, GPU time  44.5ms, CPU latency 513.6ms, most steps  428, nps 1384k
4096 jobs,  56363 nodes, GPU time  47.0ms, CPU latency 434.2ms, most steps  369, nps 1198k
4096 jobs,  63393 nodes, GPU time  39.0ms, CPU latency 420.3ms, most steps  306, nps 1626k
4096 jobs,  73533 nodes, GPU time  77.5ms, CPU latency 374.5ms, most steps  570, nps  948k
4096 jobs,  39181 nodes, GPU time  30.7ms, CPU latency 347.3ms, most steps  240, nps 1277k
4096 jobs,  52712 nodes, GPU time  39.0ms, CPU latency 334.4ms, most steps  316, nps 1351k
4096 jobs,  46191 nodes, GPU time  41.7ms, CPU latency 346.8ms, most steps  415, nps 1106k
4096 jobs,  44395 nodes, GPU time  45.4ms, CPU latency 368.5ms, most steps  456, nps  977k
4096 jobs,  40598 nodes, GPU time  23.5ms, CPU latency 334.4ms, most steps  274, nps 1727k
4096 jobs,  58922 nodes, GPU time  38.3ms, CPU latency 326.5ms, most steps  380, nps 1539k
4096 jobs,  63757 nodes, GPU time  66.8ms, CPU latency 350.6ms, most steps  786, nps  954k
4096 jobs,  79772 nodes, GPU time 115.8ms, CPU latency 352.1ms, most steps  698, nps  688k
4096 jobs,  55011 nodes, GPU time  61.3ms, CPU latency 408.4ms, most steps  540, nps  896k
4096 jobs,  51189 nodes, GPU time  54.5ms, CPU latency 331.1ms, most steps  590, nps  939k
4096 jobs,  85050 nodes, GPU time  75.0ms, CPU latency 402.3ms, most steps  410, nps 1133k
4096 jobs,  73920 nodes, GPU time  90.3ms, CPU latency 486.9ms, most steps  595, nps  818k
4096 jobs,  41490 nodes, GPU time  73.7ms, CPU latency 556.6ms, most steps  850, nps  563k
4096 jobs,  42782 nodes, GPU time  58.2ms, CPU latency 588.3ms, most steps  726, nps  734k
4096 jobs,  52909 nodes, GPU time  39.0ms, CPU latency 556.2ms, most steps  196, nps 1358k
4096 jobs,  40576 nodes, GPU time  92.5ms, CPU latency 537.6ms, most steps  456, nps  438k
4096 jobs,  61760 nodes, GPU time  73.9ms, CPU latency 549.1ms, most steps  845, nps  835k
4096 jobs,  44468 nodes, GPU time  44.3ms, CPU latency 537.0ms, most steps  443, nps 1004k
4096 jobs,  49727 nodes, GPU time  50.1ms, CPU latency 513.3ms, most steps  354, nps  992k
4096 jobs,  44698 nodes, GPU time  32.9ms, CPU latency 455.7ms, most steps  386, nps 1360k
4096 jobs,  53483 nodes, GPU time  45.1ms, CPU latency 427.3ms, most steps  500, nps 1184k

Working on it...

StuartRiffle · Post by **StuartRiffle** » Wed Nov 09, 2016 4:13 pm

Some progress... below is a cut-and paste of what nsight shows for one kernel launch:

post a picture

tttony · Post by **tttony** » Fri Nov 11, 2016 12:59 am

Excelent!!

But I can't test it, I have an AMD card

I remember testing the ZetaDva but Srdja has in stand-by the project

AdminX · Post by **AdminX** » Fri Oct 02, 2020 3:10 pm

Any more news on Pigeon 1.6.0?

Pigeon is now running on the GPU

Re: Pigeon is now running on the GPU

Re: Pigeon is now running on the GPU

Re: Pigeon is now running on the GPU

Re: Pigeon is now running on the GPU

Re: Pigeon is now running on the GPU