yet another attempt on Perft(14)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: yet another attempt on Perft(14)

Post by sje »

Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

The bottom bits

Post by sje »

Again:

Code: Select all

Peter: 110101101011010011110000110011000101100111100000100110000110010101
Ankan: 110101101011010011110000110011000101100111100000100011100110010101
1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: yet another attempt on Perft(14)

Post by ankan »

sje wrote:Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums
Ok, will do.
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: The bottom bits

Post by ankan »

sje wrote:Again:

Code: Select all

Peter: 110101101011010011110000110011000101100111100000100110000110010101
Ankan: 110101101011010011110000110011000101100111100000100011100110010101
1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?
Very interesting observation! So random bit flip due to cosmic ray seems more probable now (than a bug in the program or false positive matches).

When reading from transposition tables, my program could be more prone to errors due to random bit flips as I don't use any locks or the lockless hashing XOR trick when accessing transposition tables for shallow levels. Shallow levels are depths 2,3 and 4 where I can fit the perft value in the index bits making the entire entry just 128 bit in size. CUDA provides 128 bit atomic reads/writes so I don't need any protection for parallel accesses. However this was a totally useless optimization as the random memory access to transposition table is orders of magnitude slower than a XOR.
I will get rid of this optimization for my next runs - adding the XOR trick back. The XOR trick provides very good protection against random bit flips by discarding the entry in such cases.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Cosmic rays

Post by sje »

Well, be careful with making any changes which might somehow introduce a bug. Whenever I made changes in Symbolic, I then ran many regression tests to build confidence that I hadn't messed up.

Independent verification is the only way to detect cosmic ray events. If the verification is done using the work unit perft(7) format, then simple file comparison can identify any differences. This has worked well for the the work unit range 400-550 so far.

I would be VERY interested if you could detect any differences for work units 000-399 between your revised program's output and Symbolic's output. I would immediately use Symbolic to re-run any positions which had different results.

The suspected cosmic ray could have hit either memory or a processor. My first idea here is that the GPU memory is the weakest link as it is reportedly not as reliable as mainboard RAM. Also, A CPU is more resistant to ray events than memory. A momentary flaw in GPU RAM will generally not kill a user or system program; maybe just a transient visual glitch instead.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: The bottom bits

Post by petero2 »

ankan wrote:
sje wrote:Again:

Code: Select all

Peter: 110101101011010011110000110011000101100111100000100110000110010101
Ankan: 110101101011010011110000110011000101100111100000100011100110010101
1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?
Very interesting observation! So random bit flip due to cosmic ray seems more probable now (than a bug in the program or false positive matches).

When reading from transposition tables, my program could be more prone to errors due to random bit flips as I don't use any locks or the lockless hashing XOR trick when accessing transposition tables for shallow levels. Shallow levels are depths 2,3 and 4 where I can fit the perft value in the index bits making the entire entry just 128 bit in size. CUDA provides 128 bit atomic reads/writes so I don't need any protection for parallel accesses. However this was a totally useless optimization as the random memory access to transposition table is orders of magnitude slower than a XOR.
I will get rid of this optimization for my next runs - adding the XOR trick back. The XOR trick provides very good protection against random bit flips by discarding the entry in such cases.
Very interesting indeed.

If your next run gives the same value as your first run it would suggest my calculation is wrong, especially if you also use a new set of zobrist random values. I have started a verification run where I intend to calculate perft 13 for all positions after white's first move. Early estimates indicate that computing perft 13 after a2-a3 will take a little less than 2 days. I have no estimate yet for how long the whole verification run will take.

I agree using the XOR trick is a good idea and basically free in terms of computation time. It gives protection from random bit errors for most of the memory used by the perft calculation.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: yet another attempt on Perft(14)

Post by sje »

ankan wrote:
sje wrote:Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums
Ok, will do.
Thank you.

More file links:

On Google Drive:

Raw work units, ten folders:

https://docs.google.com/folderview?id=0 ... =drive_web

Completed work units, ten folders:

https://docs.google.com/folderview?id=0 ... =drive_web

Other files:

https://docs.google.com/folderview?id=0 ... =drive_web
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: yet another attempt on Perft(14)

Post by smatovic »

Gee :-)

Do you have some occupancy profile data? (ALU vs Memory busy)

Ho many threads per gpu do you run?

How does Pascal perform on 64 bit integers compared to Maxwell/Kepler?

--
Srdja
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: yet another attempt on Perft(14)

Post by ankan »

smatovic wrote:Gee :-)

Do you have some occupancy profile data? (ALU vs Memory busy)

Ho many threads per gpu do you run?

How does Pascal perform on 64 bit integers compared to Maxwell/Kepler?

--
Srdja
From the CPU, I launch a kernel with just a single thread. It uses cuda dynamic parallelism to launch multiple kernels to explore the tree in breadth first fashion (see perft_bb_gpu_simple_hash in https://github.com/ankan-ban/perft_gpu/ ... perft_bb.h).

Some of the kernels are ALU bound (e.g, the kernel that counts moves at the leaf nodes) and some are memory bound (e.g: hash table lookups and updates).
Profiler shows that my ALU bound kernel to count child moves has quite high divergence but still performs well enough in practice :-)
In most of my memory bound kernels, memory bandwidth utilization is really poor due to random nature of hash table accesses so I don't use hash table for last level (i.e, perft(1) values are never stored in hash table). Even for depth 2, transposition tables provide only a very small benefit in performance.

I try to fit as big tree in memory as possible but it's limited by available memory (1.5GB for storing the tree). The max no of threads that can be launched is about 100 million (for the kernel that counts moves at leaf nodes) - but it's normally less depending on the position. I launch sub-trees of max depth 7 on the GPU. The rest of the tree i.e, the deeper levels are explored in depth first search on the CPU. In case a depth 7 launch fails with out of memory, it's relaunched as multiple depth 6 sub-trees.

From what I have seen, Pascal GP104/GP102 have exactly similar throughput per SM per clock as Maxwell as far my perft program is concerned. iow: nothing special for 64 bit integers. Of course the increased no. of SMs and higher clocks help a lot.
I believe the Pro/Tesla Pascal GP100 could be more interesting but I don't have access to it right now.

Maxwell is lot better than Kepler though. E.g the gtx 970 is 40% faster than GK110 for perft even when their GFlops ratings are similar. I have some benchmarks here:
http://ankan.blogspot.in/2016_07_03_arc ... 1803576271
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: yet another attempt on Perft(14)

Post by smatovic »

Gee :-)

Thanks for the insights.

I may not think about what happens if you start to code the actual gpu chess engine...

--
Srdja