pawn enumeration

Daniel Shawul · Post by **Daniel Shawul** » Sat Jul 28, 2012 4:49 pm

syzygy wrote:
Daniel Shawul wrote:
4. loop through each table (white, black) to find mates (set to loss). A position is mate if its value is unknown, the value of the corresponding position in the other table is illegal, and all non-capture moves lead to a position with value illegal. Set parents to newwin.
I also work on two sides in the same pass. But I am not sure I understand the rest. later
The test for whether a position is mate is as follows: if white is to move, the value in the white table is unknown, the value in the black position is illegal, and all non-capture moves bring you to a position (in the black table) with value illegal, then the position is mate. Otherwise, it is not.

So what I do is loop through white and black at the same time. If one value equals unknown and the other equals illegal, then I loop through the moves for the appropriate side and stop once I find a legal move. If I don't find legal moves, I set the position to from unknown to loss. (I know that there are no legal captures either, since otherwise the value of the position would have been potloss, newwin or draw and not unknown.)

As you see, I never need to verify capture moves. Not when testing for mate, not when verifying potloss cases. That might be different in your generator (I'm not sure). This is the reason for the draw state.

I also do not verify with captures. All the hardwork of captures/promotions with probably disk access is done in the first pass. From then on the backpass-verfication steps work only with non-captures transposing positions with in the tb that is being generated. So we do the same thing. I tell you if we work out the difference we have in handling of the bits 2 + 1 vs 3 that seem to imply different methods, we are talking the same algorithm ...

syzygy · Post by **syzygy** » Sat Jul 28, 2012 5:02 pm

Daniel Shawul wrote:Draws can be gotten from iteration 0 from caps and promotions but are never backpassed after that.

I don't either.

When I do verfication ofcourse it updates with draws to. But I don't include caputes in the verfication (it does non-caps only so does the retro move). The captures make their contribution in the first pass and they are never done again. Maybe this wrong somehow but all I know is it gets the same numbers at least for 4 men.

I said you did not have a draw state, but of course you do have one. Your draw state seems to be identical to my draw state. You use it to mark positions that are at least a draw because of a capture that draws. This way you don't have to look at capture moves when verifying a potential loss.

syzygy · Post by **syzygy** » Sat Jul 28, 2012 5:11 pm

Daniel Shawul wrote:I am curious what your miss rate for L1 and L2 is for 3 men and 4 men generation to see if there is hope for cache optimizations anyway. I don't think we can compare numbers of different programs but just the effect with exponential increase of tablebase sizes would be interesting to know.

Generation of KQvKR, using 8 hyperthreads:

Code: Select all

       1528.957251 task-clock                #    6.631 CPUs utilized          
             1,591 context-switches          #    0.001 M/sec                  
               120 CPU-migrations            #    0.078 K/sec                  
               746 page-faults               #    0.488 K/sec                  
     4,194,069,138 cycles                    #    2.743 GHz                     [24.87%]
     2,549,534,087 stalled-cycles-frontend   #   60.79% frontend cycles idle    [24.64%]
     1,281,299,091 stalled-cycles-backend    #   30.55% backend  cycles idle    [24.82%]
     4,334,027,196 instructions              #    1.03  insns per cycle        
                                             #    0.59  stalled cycles per insn [31.12%]
       939,402,282 branches                  #  614.407 M/sec                   [31.34%]
        35,407,002 branch-misses             #    3.77% of all branches         [31.82%]
     1,104,497,694 L1-dcache-loads           #  722.386 M/sec                   [32.66%]
        42,527,465 L1-dcache-load-misses     #    3.85% of all L1-dcache hits   [32.56%]
         1,840,449 LLC-loads                 #    1.204 M/sec                   [25.86%]
         1,077,595 LLC-load-misses           #   58.55% of all LL-cache hits    [26.09%]
   <not supported> L1-icache-loads         
         2,013,521 L1-icache-load-misses     #    0.00% of all L1-icache hits   [25.57%]
     1,142,266,718 dTLB-loads                #  747.089 M/sec                   [25.49%]
         1,478,513 dTLB-load-misses          #    0.13% of all dTLB cache hits  [25.60%]
         1,064,172 iTLB-loads                #    0.696 M/sec                   [25.26%]
           334,924 iTLB-load-misses          #   31.47% of all iTLB cache hits  [25.15%]
   <not supported> L1-dcache-prefetches    
         9,061,017 L1-dcache-prefetch-misses #    5.926 M/sec                   [25.04%]

       0.230572191 seconds time elapsed

Generation of KQRvKQ, using 8 hyperthreads:

Code: Select all

     100967.414333 task-clock                #    7.786 CPUs utilized          
            13,903 context-switches          #    0.138 K/sec                  
               193 CPU-migrations            #    0.002 K/sec                  
             2,841 page-faults               #    0.028 K/sec                  
   279,682,778,294 cycles                    #    2.770 GHz                     [25.00%]
   178,183,358,936 stalled-cycles-frontend   #   63.71% frontend cycles idle    [24.99%]
    82,284,227,902 stalled-cycles-backend    #   29.42% backend  cycles idle    [24.98%]
   309,427,033,521 instructions              #    1.11  insns per cycle        
                                             #    0.58  stalled cycles per insn [31.24%]
    72,417,633,288 branches                  #  717.238 M/sec                   [31.24%]
     1,272,324,811 branch-misses             #    1.76% of all branches         [31.25%]
    96,098,230,306 L1-dcache-loads           #  951.775 M/sec                   [31.26%]
     3,172,712,692 L1-dcache-load-misses     #    3.30% of all L1-dcache hits   [31.28%]
       170,081,246 LLC-loads                 #    1.685 M/sec                   [25.03%]
        51,718,899 LLC-load-misses           #   30.41% of all LL-cache hits    [25.02%]
   <not supported> L1-icache-loads         
       128,304,130 L1-icache-load-misses     #    0.00% of all L1-icache hits   [25.01%]
    96,046,752,523 dTLB-loads                #  951.265 M/sec                   [25.00%]
       699,025,050 dTLB-load-misses          #    0.73% of all dTLB cache hits  [25.00%]
        73,309,711 iTLB-loads                #    0.726 M/sec                   [25.01%]
         1,554,304 iTLB-load-misses          #    2.12% of all iTLB cache hits  [25.01%]
   <not supported> L1-dcache-prefetches    
       864,940,310 L1-dcache-prefetch-misses #    8.567 M/sec                   [25.01%]

      12.968096780 seconds time elapsed

Daniel Shawul · Post by **Daniel Shawul** » Sat Jul 28, 2012 5:42 pm

Unless I misread the results, it is somewhat surprising that the LLC misses actually decreased from 58.5% to 30%. You would expect more misses for KQR.KQ than in KQ.KR, but it could be just that the measurement is not accurate for the 4 men since less time is spent in generating that. I am running cache grind on kQkr which is extremely slow. KQR.kr will take very long so I don't think I will do that.
Edit
Well my numbers for KQkr are not very telling as I don't seem to have considerable cache misses anyway. Maybe it is just cachegrind. Anyway more if/when KQRkr finishes.
==31677==
==31677== I refs: 47,992,652,545
==31677== I1 misses: 4,832,577
==31677== L2i misses: 3,980
==31677== I1 miss rate: 0.01%
==31677== L2i miss rate: 0.00%
==31677==
==31677== D refs: 17,189,436,209 (12,765,330,115 rd + 4,424,106,094 wr )
==31677== D1 misses: 11,891,827 ( 8,838,240 rd + 3,053,587 wr )
==31677== L2d misses: 3,335,584 ( 3,155,709 rd + 179,875 wr )
==31677== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==31677== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==31677==
==31677== L2 refs: 16,724,404 ( 13,670,817 rd + 3,053,587 wr )
==31677== L2 misses: 3,339,564 ( 3,159,689 rd + 179,875 wr )
==31677== L2 miss rate: 0.0% ( 0.0% + 0.0%

syzygy · Post by **syzygy** » Sat Jul 28, 2012 5:46 pm

Hmm, for the above I used "perf stat -d -d -d". If I use "perf stat -d -d" I get very similar numbers, except for:

Code: Select all

        65,703,566 LLC-loads                 #    0.647 M/sec
            73,131 LLC-load-misses           #    0.11% of all LL-cache hits

The reported number of last-level cache loads (that would be L3 cache loads) goes down considerably. The reported number of last-level cache misses goes down enormously.

I am not sure what "-d -d" and "-d -d -d" are supposed to do exactly. I just copied the "-d -d -d" from a webpage.

edit:
Just using "-d" I get:

Code: Select all

        71,495,603 LLC-loads                 #    0.711 M/sec                   
        18,642,996 LLC-load-misses           #   26.08% of all LL-cache hits

Weird.

syzygy · Post by **syzygy** » Sat Jul 28, 2012 7:15 pm

syzygy wrote:
hgm wrote:That is a lot, and seems much more than could be explained from the execution time of the few extra instructions. This suggests it actually decreases cache efficiency.
Well, one problem is that I did the transformation also in the main loop when looping over the table looking for work to do. When I loop through the table linearly, and do the reverse transformation before doing the work, things get a bit faster. Still, it is 10% slower.

A second problem is that my transformation might not sufficiently solve the problem, so I replaced (idx ^ (idx >> 12)) by (idx + (idx >> 12)) (with appropriate masking to make things work correctly). However, this only slows down things further.

I've measured cache-misses and they do indeed increase.

That was with "perf stat -e cache-misses".

Measuring just the number of L1 dcache misses shows that twisting of the index lowers the miss rate from 3.18% to 2.35% for KQRvKQ. However, this does not offset the extra computations.

Measuring LLC-loads and LLC-load-misses using perf seems to be broken.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jul 28, 2012 7:19 pm

Maybe it is calculating the percentage from different refereces..
I am still waiting on the KQRKQ cache statistics as cachegrind slows down generation by more than 20x. KQKR is done on 4 MB RAM while KQRKQ uses about 249 MB. So I expect the LLC cache miss to increase from the non-telling 0% figure I got previously. In the mean time I did some tests to compare the effect of verfication. In general I get:
a)Using single bit visit indicator is slower by 3X compared to the one that uses full byte.
b)The beta-cutoff on non visited positions speeds up verfication so much that I got a 3X slowdown if I look for WIN/DRAWs anyway throught the verification list. So I guess my overhead is the move generation since it makes too much beta cutoffs. Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..

syzygy · Post by **syzygy** » Sat Jul 28, 2012 7:22 pm

Daniel Shawul wrote:Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..

You shouldn't generate and store lists of moves at all. Just generate destination squares and use those to calculate the indices you need to check. Don't re-use a move generator from a normal chess engine.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jul 28, 2012 7:34 pm

syzygy wrote:
Daniel Shawul wrote:Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..
You shouldn't generate and store lists of moves at all. Just generate destination squares and use those to calculate the indices you need to check. Don't re-use a move generator from a normal chess engine.

Well that would be like the rybka like optimization that does delta-pruned captures while generating moves IIRC, or scoring while move generation etc... Very bad code for my style as I would like to keep it as simple as generating and storing the moves. Also my test which showed slow down by 3x even if I did the move generation in _both_ cases. That indicates move generation is not a problem at all but whether I check the results of all the positions or not ... That is why I want to calculate the cutoff statistics.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jul 28, 2012 8:09 pm

I measured the percentage where we don't get off a cutoff when asking for verfication. The number of cutoffs is pretty high especially for some endgames and those are the difficult ones. Miss rate is worst when one side is significantly stronger. But there you have it more than 95% except in the cases where forward generation would have worked best.

Code: Select all

           Miss rate  = 100 - cutoff rate
KQkq   0.06%
KQkr   4.18%
KQkb   8.04%
KQkn   11.54%
KQkp   23.67%
KRkr   0.17%
KRkb   0.53%
etc

I am going to squeeze in whatever I can get in from using incremental move generation now as the cutoff is pretty hight to ignore.

pawn enumeration

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Cutoff rates