yet another attempt on Perft(14)

sje · Post by **sje** » Sun Aug 14, 2016 6:14 am

Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums

sje · Post by **sje** » Sun Aug 14, 2016 7:19 am

Again:

Code: Select all

Peter&#58; 110101101011010011110000110011000101100111100000100110000110010101
Ankan&#58; 110101101011010011110000110011000101100111100000100011100110010101

1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?

ankan · Post by **ankan** » Sun Aug 14, 2016 7:49 am

sje wrote:Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums

Ok, will do.

ankan · Post by **ankan** » Sun Aug 14, 2016 8:02 am

sje wrote:Again:
Code: Select all
Peter&#58; 110101101011010011110000110011000101100111100000100110000110010101
Ankan&#58; 110101101011010011110000110011000101100111100000100011100110010101
1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?

Very interesting observation! So random bit flip due to cosmic ray seems more probable now (than a bug in the program or false positive matches).

When reading from transposition tables, my program could be more prone to errors due to random bit flips as I don't use any locks or the lockless hashing XOR trick when accessing transposition tables for shallow levels. Shallow levels are depths 2,3 and 4 where I can fit the perft value in the index bits making the entire entry just 128 bit in size. CUDA provides 128 bit atomic reads/writes so I don't need any protection for parallel accesses. However this was a totally useless optimization as the random memory access to transposition table is orders of magnitude slower than a XOR.
I will get rid of this optimization for my next runs - adding the XOR trick back. The XOR trick provides very good protection against random bit flips by discarding the entry in such cases.

sje · Post by **sje** » Sun Aug 14, 2016 8:40 am

Well, be careful with making any changes which might somehow introduce a bug. Whenever I made changes in Symbolic, I then ran many regression tests to build confidence that I hadn't messed up.

Independent verification is the only way to detect cosmic ray events. If the verification is done using the work unit perft(7) format, then simple file comparison can identify any differences. This has worked well for the the work unit range 400-550 so far.

I would be VERY interested if you could detect any differences for work units 000-399 between your revised program's output and Symbolic's output. I would immediately use Symbolic to re-run any positions which had different results.

The suspected cosmic ray could have hit either memory or a processor. My first idea here is that the GPU memory is the weakest link as it is reportedly not as reliable as mainboard RAM. Also, A CPU is more resistant to ray events than memory. A momentary flaw in GPU RAM will generally not kill a user or system program; maybe just a transient visual glitch instead.

petero2 · Post by **petero2** » Sun Aug 14, 2016 8:45 am

ankan wrote:
sje wrote:Again:
Code: Select all
Peter&#58; 110101101011010011110000110011000101100111100000100110000110010101
Ankan&#58; 110101101011010011110000110011000101100111100000100011100110010101
1. Note that the bottom eleven bits are the same. This strongly suggests that the difference of the totals is NOT due to false positive matches (p ~= 1 - 2^-11).

2. Note that the three different bits in the XOR sum are contained in four consecutive positions. The odds of this happening by chance are about one in a hundred; this suggests a single cosmic ray event on one machine causing the entire difference. But if so, then which is the correct result?
Very interesting observation! So random bit flip due to cosmic ray seems more probable now (than a bug in the program or false positive matches).

When reading from transposition tables, my program could be more prone to errors due to random bit flips as I don't use any locks or the lockless hashing XOR trick when accessing transposition tables for shallow levels. Shallow levels are depths 2,3 and 4 where I can fit the perft value in the index bits making the entire entry just 128 bit in size. CUDA provides 128 bit atomic reads/writes so I don't need any protection for parallel accesses. However this was a totally useless optimization as the random memory access to transposition table is orders of magnitude slower than a XOR.
I will get rid of this optimization for my next runs - adding the XOR trick back. The XOR trick provides very good protection against random bit flips by discarding the entry in such cases.

Very interesting indeed.

If your next run gives the same value as your first run it would suggest my calculation is wrong, especially if you also use a new set of zobrist random values. I have started a verification run where I intend to calculate perft 13 for all positions after white's first move. Early estimates indicate that computing perft 13 after a2-a3 will take a little less than 2 days. I have no estimate yet for how long the whole verification run will take.

I agree using the XOR trick is a good idea and basically free in terms of computation time. It gives protection from random bit errors for most of the memory used by the perft calculation.

sje · Post by **sje** » Sun Aug 14, 2016 10:11 am

ankan wrote:
sje wrote:Ankan:

Would it be possible to use your revised 128 bit signature program to check Symbolic's output for the first 400 work units (000-399)?

The results I have for these are at https://dl.dropboxusercontent.com/u/316 ... /PerftSums
Ok, will do.

Thank you.

More file links:

On Google Drive:

Raw work units, ten folders:

https://docs.google.com/folderview?id=0 ... =drive_web

Completed work units, ten folders:

https://docs.google.com/folderview?id=0 ... =drive_web

Other files:

https://docs.google.com/folderview?id=0 ... =drive_web

smatovic · Post by **smatovic** » Sun Aug 14, 2016 5:03 pm

Gee

Do you have some occupancy profile data? (ALU vs Memory busy)

Ho many threads per gpu do you run?

How does Pascal perform on 64 bit integers compared to Maxwell/Kepler?

--
Srdja

ankan · Post by **ankan** » Sun Aug 14, 2016 8:29 pm

smatovic wrote:Gee

Do you have some occupancy profile data? (ALU vs Memory busy)

Ho many threads per gpu do you run?

How does Pascal perform on 64 bit integers compared to Maxwell/Kepler?

--
Srdja

From the CPU, I launch a kernel with just a single thread. It uses cuda dynamic parallelism to launch multiple kernels to explore the tree in breadth first fashion (see perft_bb_gpu_simple_hash in https://github.com/ankan-ban/perft_gpu/ ... perft_bb.h).

Some of the kernels are ALU bound (e.g, the kernel that counts moves at the leaf nodes) and some are memory bound (e.g: hash table lookups and updates).
Profiler shows that my ALU bound kernel to count child moves has quite high divergence but still performs well enough in practice

In most of my memory bound kernels, memory bandwidth utilization is really poor due to random nature of hash table accesses so I don't use hash table for last level (i.e, perft(1) values are never stored in hash table). Even for depth 2, transposition tables provide only a very small benefit in performance.

I try to fit as big tree in memory as possible but it's limited by available memory (1.5GB for storing the tree). The max no of threads that can be launched is about 100 million (for the kernel that counts moves at leaf nodes) - but it's normally less depending on the position. I launch sub-trees of max depth 7 on the GPU. The rest of the tree i.e, the deeper levels are explored in depth first search on the CPU. In case a depth 7 launch fails with out of memory, it's relaunched as multiple depth 6 sub-trees.

From what I have seen, Pascal GP104/GP102 have exactly similar throughput per SM per clock as Maxwell as far my perft program is concerned. iow: nothing special for 64 bit integers. Of course the increased no. of SMs and higher clocks help a lot.
I believe the Pro/Tesla Pascal GP100 could be more interesting but I don't have access to it right now.

Maxwell is lot better than Kepler though. E.g the gtx 970 is 40% faster than GK110 for perft even when their GFlops ratings are similar. I have some benchmarks here:
http://ankan.blogspot.in/2016_07_03_arc ... 1803576271

smatovic · Post by **smatovic** » Sun Aug 14, 2016 9:31 pm

Gee

Thanks for the insights.

I may not think about what happens if you start to code the actual gpu chess engine...

--
Srdja

yet another attempt on Perft(14)

Re: yet another attempt on Perft(14)

The bottom bits

Re: yet another attempt on Perft(14)

Re: The bottom bits

Cosmic rays

Re: The bottom bits

Re: yet another attempt on Perft(14)

Re: yet another attempt on Perft(14)

Re: yet another attempt on Perft(14)

Re: yet another attempt on Perft(14)