Page 1 of 3

Generate EGTB with graphics cards?

Posted: Tue Jan 01, 2019 4:29 am
by phhnguyen
We have been working on EGTB generators for a while. One of the main problems beside the need of huge memory, is that those generators are always hungry for computing power, more CPUs/cores is better. One of idea we want to try is to use graphics cards for generating. AFAIK, a good graphics cards may have few hundreds or thousands simple CPUs (GPUs) and for simple / suitable tasks they are significantly faster than an expensive computer CPU with few-tens-cores.

From my experience, a generator has do to few simple tasks in very high frequency:
  • convert an index (a 64-bit integer) to a chess position and vise versa
  • generate moves for a given position and/or check incheck
  • get/set numbers from/to huge buffers
Accessing hard disks is not a problem since the generator can read only what its need at initialising and write down only when it completed an endgame.

Using multi threads is not a problem either since the generator can easily divide work for any number of threads

Questions:
  • Has anyone tried that idea (using graphics cards for Egtb generators)?
  • Any potential problems? Say, GPUs are not good for processing integers / move generating / incheck? Latency too large between GPUs and memory? (I have almost no knowledge about GPUs)
Thanks

Re: Generate EGTB with graphics cards?

Posted: Tue Jan 01, 2019 3:15 pm
by grahamj
I am learning how to program GPUs, in particular, NVIDIA GPUs. I think other GPUs are similar, but I'll stick to what I (half) know. On terminology, a single NVIDIA GPU has 100s or 1000s of 'CUDA cores'.

1. You will be launching LOTS of threads. It depends on the calculation, but to fully utilise the GPU, you may need tens of 1000s of threads.

2. For time-critical code, you want at least 32 threads executing the same instructions in lockstep (on different data). Search 'warp divergence'.

3. There is extremely limited access to fast memory (registers, 'shared memory', caches) per thread. There is big latency for the main GPU memory. Scattered memory accesses are bad. If you can 'stream through' memory, like in a lot of linear algebra, it's much better.

4. CUDA cores have 32 bit registers, but otherwise there's no problem processing integers.

I don't know how EGTB generators work, but I would be optimistic about 'convert an index (a 64-bit integer) to a chess position and vise versa' and 'generate moves for a given position and/or check incheck' and worried about 'get/set numbers from/to huge buffers'.

I recommend this video as an introduction to CUDA programming.
https://www.youtube.com/watch?v=KHa-OSrZPGo&t=2571s

Re: Generate EGTB with graphics cards?

Posted: Thu Jan 03, 2019 1:13 pm
by phhnguyen
Thanks a lot for useful information!

I have been reading about accessing RAM from graphics cards. Just read that AMD has been targeting on that issue for a while. However I am not sure if they are successful (accessing randomly RAM with low latency from GPU). Does anyone have information/confirmation on that?

Re: Generate EGTB with graphics cards?

Posted: Thu Jan 03, 2019 4:32 pm
by Evert
From what I recall, tablebase generation is generally I/O bound, not CPU bound.

A tablebase generator typically uses an algorithm like this:
1. Allocate a bitlist for all possible positions
2. Mark all “lost” positions
3. Retro-move from all positions found in 2/5: these positions are “won”.
4. Retro-move from all positions found in 3: these positions are “potentially lost”.
5. Verify: “potentially lost” positions for which all children are “won” are “lost”. Mark those.
6. Repeat from 3 until no new lost positions are found.

Typically, moving a piece causes a major change in the index, which means the new position is probably not in local storage.

So the key to doing tablebases efficiently with a GPU is finding a smart way to index the table.

Re: Generate EGTB with graphics cards?

Posted: Thu Jan 03, 2019 6:21 pm
by hgm
I know very little of GPUs, but I suppose they also have a very wide data path to the video memory. OTOH, it is not likely the video memory would be large enough to hold the entire EGT. So then you would become dependent on the main DRAM, which communicates rather slowly with video memory through a narrow data path. That doesn't look good.

In the EGT generator on my old website (basically FairyGen) profiling showed that the single statement data=EGT[index] was responsible for 99.5% of the execution time. The entire remainder of the calculation only took 0.5% of the time. So the potential gain of parallellizing that seems very little. I must admit the storage scheme of that generator is not very compact, though: it uses 1 byte per position (to hold DTM for black to move, plus a 'won' bit flag for white to move). If you would use 1 bit per position, that would reduce the memory bottleneck by a factor 8.

Another question would be: "why do it at all"? For orthodox Chess all 7-men have already been done. And 8-men are too big to store the result.

Re: Generate EGTB with graphics cards?

Posted: Thu Jan 03, 2019 8:54 pm
by Dann Corbit
The biggest video RAM on a single card {that I am aware of} is 32 GB.

A reason for doing it on a GPU would be for fun.

I think that the GPU would be a very natural place for a mate solver that uses proof search.
That is because the world's fastest move generator runs on GPU.
Ankan made a mate solver that runs on GPU but it does not look like proof search to me.

Re: Generate EGTB with graphics cards?

Posted: Fri Jan 04, 2019 12:41 am
by phhnguyen
Evert wrote: Thu Jan 03, 2019 4:32 pm From what I recall, tablebase generation is generally I/O bound, not CPU bound.
hgm wrote: Thu Jan 03, 2019 6:21 pm In the EGT generator on my old website (basically FairyGen) profiling showed that the single statement data=EGT[index] was responsible for 99.5% of the execution time. The entire remainder of the calculation only took 0.5% of the time. So the potential gain of parallellizing that seems very little.
I am surprised since in your generators the calculating is faster than I/O. For mine I/O is not an issue since it is designed to work in memory only: everything it needs will be loaded into memory (in non-compression format) when initialising an endgame (I focus on building time and got helps from some volunteers whose computers may have huge RAM). For any loop of generating, the generator frequently reads and writes scores (DTM) into some huge memory buffers. Sure it is not really fast since it always misses all caches due to access randomly huge memory buffers (say, over 60 GB) and with multi threads (say over 10 threads). However it is not that bad.

On the other hand, to process an index (typically an item of 1 byte data) in huge buffers, my generator needs to convert that index into a chess position, then generate moves for that position, for each move, do make into a new position, do incheck, then convert the new position into new index, get score, take back... that is a lot calculating and I am sure it (calculating) takes much more time than reading/writing from/to (even huge) buffers. Multi threads are really helpful for my generator (proved via practising).

That is why I am looking for more and cheaper computing power from GPU.
Evert wrote: Thu Jan 03, 2019 4:32 pm So the key to doing tablebases efficiently with a GPU is finding a smart way to index the table.
At the moment I am out of ideas for using small GPU memory (few GB only) for caching/speeding up accessing huge buffers (say, over 60 GB) in RAM and with a large number of threads. Do you have ideas (work even with your design of the generator)?

Re: Generate EGTB with graphics cards?

Posted: Fri Jan 04, 2019 12:57 am
by phhnguyen
hgm wrote: Thu Jan 03, 2019 6:21 pm Another question would be: "why do it at all"? For orthodox Chess all 7-men have already been done. And 8-men are too big to store the result.
Dann Corbit wrote: Thu Jan 03, 2019 8:54 pm A reason for doing it on a GPU would be for fun.

I think that the GPU would be a very natural place for a mate solver that uses proof search.
That is because the world's fastest move generator runs on GPU.
Ankan made a mate solver that runs on GPU but it does not look like proof search to me.
I agreed with Dann. First it is (very) fun. Second it is still useful!

I think if we can generate a full 6-men within 12h many people will generate it instead of downloading via slow Internet providers.

I am sure someone start looking to generate some (may not all) 8-men endgames.

For me, I have been working and generating EGTBs for some chess variants such as Xiangqi and Jeiqi.


Dann Corbit wrote: Thu Jan 03, 2019 8:54 pm The biggest video RAM on a single card {that I am aware of} is 32 GB.
For Mac world, Apple may take me an arm for an eGPU (Blackmagic eGPU). It has only 8 GB VRAM but costs about $1000.

Re: Generate EGTB with graphics cards?

Posted: Fri Jan 04, 2019 8:53 am
by hgm
phhnguyen wrote: Fri Jan 04, 2019 12:41 amFor mine I/O is not an issue since it is designed to work in memory only: everything it needs will be loaded into memory ...
That also holds for FairyGen. Yet it takes 2-5 min to generate a Pawnless (i.e. 8-fold symmetric) 5-men EGT. Marcel van Kervinck's "Pretty Fast" generator (which is bit-based, rather than byte-based) is a lot faster: a totally unsymmetric version of that needs about 110 msec for generating a 4-men EGT like KBNK or KQKR (IIRC). That is more than an order of magnitude faster. So speeds can vary a lot. Even between algorithms that in themselves appear well-optimized.

Extrapolating from the 110 msec would give ~6 sec for a 5-men, and about 6 min for a 6-men... Even for FairyGen a 6-men Pawnless EGT should only take 2-5 hours. (But I never tried that, as the memory requirement would exceed what you can address in a 32-bit environment, and I did not have a 64-bit compiler at that time, and only 4GB DRAM in my laptop anyway.)
Sure it is not really fast since it always misses all caches due to access randomly huge memory buffers (say, over 60 GB) and with multi threads (say over 10 threads).
The number of cache misses can be strongly reduced by treating moves in a smart order. (E.g. through the leap-frog algorithm.)
On the other hand, to process an index (typically an item of 1 byte data) in huge buffers, my generator needs to convert that index into a chess position, then generate moves for that position, for each move, do make into a new position, do incheck, then convert the new position into new index, get score, take back...
That sounds very bad. My generators do almost none of that. And doing nothing is always far faster than doing something, even on a GPU. I never convert indices to positions or vice versa, and I never set up entire positions. E.g. the work involved in setting up positions only consists of moving a single piece once every 64 positions. (There is no need to place the black King, as it cannot block any moves without being in check, and the next group of 64 positions in 63 out of 64 cases has only a single other piece in a different location.)
that is a lot calculating and I am sure it (calculating) takes much more time than reading/writing from/to (even huge) buffers. Multi threads are really helpful for my generator (proved via practising).
Well, if you would require several hundred times as much calculation per position as is really needed, I would have no difficulty believing that. So the key question is: how fast is your generator (on a single-threaded CPU)? If that would be orders of magnitude slower than optimum, trying to cure it with faster hardware IMO would not be the best strategy. I would first take the best conceivable algorithm, and then address the hardware requirements of that.


Note that for Xiangqi efficient creation of EGT is pretty much an unsolved problem. The rules for perpetual checking and chasing are basically incompatible with the idea of retrograde generation. So although it is possible to do some end-games (namely those where one side only has defenders) in the conventional way, the more interesting ones (such as KRPKR, KRCKR) would need a completely different treatment, and I am not sure whether it is actually public knowledge how exactly to do that, and how much the solution hurts speedwise compared to not having to deal with perpetuals.

Re: Generate EGTB with graphics cards?

Posted: Fri Jan 04, 2019 9:28 am
by noobpwnftw
hgm wrote: Fri Jan 04, 2019 8:53 am Note that for Xiangqi efficient creation of EGT is pretty much an unsolved problem. The rules for perpetual checking and chasing are basically incompatible with the idea of retrograde generation. So although it is possible to do some end-games (namely those where one side only has defenders) in the conventional way, the more interesting ones (such as KRPKR, KRCKR) would need a completely different treatment, and I am not sure whether it is actually public knowledge how exactly to do that, and how much the solution hurts speedwise compared to not having to deal with perpetuals.
http://www.chessdb.cn/query_en/?9/4C2R1 ... /9/5K3%20w

http://lpforth.forthfreak.net/cg2002.pdf