I also do not verify with captures. All the hardwork of captures/promotions with probably disk access is done in the first pass. From then on the backpass-verfication steps work only with non-captures transposing positions with in the tb that is being generated. So we do the same thing. I tell you if we work out the difference we have in handling of the bits 2 + 1 vs 3 that seem to imply different methods, we are talking the same algorithm ...syzygy wrote:The test for whether a position is mate is as follows: if white is to move, the value in the white table is unknown, the value in the black position is illegal, and all non-capture moves bring you to a position (in the black table) with value illegal, then the position is mate. Otherwise, it is not.Daniel Shawul wrote:I also work on two sides in the same pass. But I am not sure I understand the rest. later4. loop through each table (white, black) to find mates (set to loss). A position is mate if its value is unknown, the value of the corresponding position in the other table is illegal, and all non-capture moves lead to a position with value illegal. Set parents to newwin.
So what I do is loop through white and black at the same time. If one value equals unknown and the other equals illegal, then I loop through the moves for the appropriate side and stop once I find a legal move. If I don't find legal moves, I set the position to from unknown to loss. (I know that there are no legal captures either, since otherwise the value of the position would have been potloss, newwin or draw and not unknown.)
As you see, I never need to verify capture moves. Not when testing for mate, not when verifying potloss cases. That might be different in your generator (I'm not sure). This is the reason for the draw state.
pawn enumeration
Moderator: Ras
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
-
syzygy
- Posts: 5840
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
I don't either.Daniel Shawul wrote:Draws can be gotten from iteration 0 from caps and promotions but are never backpassed after that.
I said you did not have a draw state, but of course you do have one. Your draw state seems to be identical to my draw state. You use it to mark positions that are at least a draw because of a capture that draws. This way you don't have to look at capture moves when verifying a potential loss.When I do verfication ofcourse it updates with draws to. But I don't include caputes in the verfication (it does non-caps only so does the retro move). The captures make their contribution in the first pass and they are never done again. Maybe this wrong somehow but all I know is it gets the same numbers at least for 4 men.
-
syzygy
- Posts: 5840
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
Generation of KQvKR, using 8 hyperthreads:Daniel Shawul wrote:I am curious what your miss rate for L1 and L2 is for 3 men and 4 men generation to see if there is hope for cache optimizations anyway. I don't think we can compare numbers of different programs but just the effect with exponential increase of tablebase sizes would be interesting to know.
Code: Select all
1528.957251 task-clock # 6.631 CPUs utilized
1,591 context-switches # 0.001 M/sec
120 CPU-migrations # 0.078 K/sec
746 page-faults # 0.488 K/sec
4,194,069,138 cycles # 2.743 GHz [24.87%]
2,549,534,087 stalled-cycles-frontend # 60.79% frontend cycles idle [24.64%]
1,281,299,091 stalled-cycles-backend # 30.55% backend cycles idle [24.82%]
4,334,027,196 instructions # 1.03 insns per cycle
# 0.59 stalled cycles per insn [31.12%]
939,402,282 branches # 614.407 M/sec [31.34%]
35,407,002 branch-misses # 3.77% of all branches [31.82%]
1,104,497,694 L1-dcache-loads # 722.386 M/sec [32.66%]
42,527,465 L1-dcache-load-misses # 3.85% of all L1-dcache hits [32.56%]
1,840,449 LLC-loads # 1.204 M/sec [25.86%]
1,077,595 LLC-load-misses # 58.55% of all LL-cache hits [26.09%]
<not supported> L1-icache-loads
2,013,521 L1-icache-load-misses # 0.00% of all L1-icache hits [25.57%]
1,142,266,718 dTLB-loads # 747.089 M/sec [25.49%]
1,478,513 dTLB-load-misses # 0.13% of all dTLB cache hits [25.60%]
1,064,172 iTLB-loads # 0.696 M/sec [25.26%]
334,924 iTLB-load-misses # 31.47% of all iTLB cache hits [25.15%]
<not supported> L1-dcache-prefetches
9,061,017 L1-dcache-prefetch-misses # 5.926 M/sec [25.04%]
0.230572191 seconds time elapsedCode: Select all
100967.414333 task-clock # 7.786 CPUs utilized
13,903 context-switches # 0.138 K/sec
193 CPU-migrations # 0.002 K/sec
2,841 page-faults # 0.028 K/sec
279,682,778,294 cycles # 2.770 GHz [25.00%]
178,183,358,936 stalled-cycles-frontend # 63.71% frontend cycles idle [24.99%]
82,284,227,902 stalled-cycles-backend # 29.42% backend cycles idle [24.98%]
309,427,033,521 instructions # 1.11 insns per cycle
# 0.58 stalled cycles per insn [31.24%]
72,417,633,288 branches # 717.238 M/sec [31.24%]
1,272,324,811 branch-misses # 1.76% of all branches [31.25%]
96,098,230,306 L1-dcache-loads # 951.775 M/sec [31.26%]
3,172,712,692 L1-dcache-load-misses # 3.30% of all L1-dcache hits [31.28%]
170,081,246 LLC-loads # 1.685 M/sec [25.03%]
51,718,899 LLC-load-misses # 30.41% of all LL-cache hits [25.02%]
<not supported> L1-icache-loads
128,304,130 L1-icache-load-misses # 0.00% of all L1-icache hits [25.01%]
96,046,752,523 dTLB-loads # 951.265 M/sec [25.00%]
699,025,050 dTLB-load-misses # 0.73% of all dTLB cache hits [25.00%]
73,309,711 iTLB-loads # 0.726 M/sec [25.01%]
1,554,304 iTLB-load-misses # 2.12% of all iTLB cache hits [25.01%]
<not supported> L1-dcache-prefetches
864,940,310 L1-dcache-prefetch-misses # 8.567 M/sec [25.01%]
12.968096780 seconds time elapsed-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
Unless I misread the results, it is somewhat surprising that the LLC misses actually decreased from 58.5% to 30%. You would expect more misses for KQR.KQ than in KQ.KR, but it could be just that the measurement is not accurate for the 4 men since less time is spent in generating that. I am running cache grind on kQkr which is extremely slow. KQR.kr will take very long so I don't think I will do that.
Edit
Well my numbers for KQkr are not very telling as I don't seem to have considerable cache misses anyway. Maybe it is just cachegrind. Anyway more if/when KQRkr finishes.
==31677==
==31677== I refs: 47,992,652,545
==31677== I1 misses: 4,832,577
==31677== L2i misses: 3,980
==31677== I1 miss rate: 0.01%
==31677== L2i miss rate: 0.00%
==31677==
==31677== D refs: 17,189,436,209 (12,765,330,115 rd + 4,424,106,094 wr )
==31677== D1 misses: 11,891,827 ( 8,838,240 rd + 3,053,587 wr )
==31677== L2d misses: 3,335,584 ( 3,155,709 rd + 179,875 wr )
==31677== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==31677== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==31677==
==31677== L2 refs: 16,724,404 ( 13,670,817 rd + 3,053,587 wr )
==31677== L2 misses: 3,339,564 ( 3,159,689 rd + 179,875 wr )
==31677== L2 miss rate: 0.0% ( 0.0% + 0.0%
Edit
Well my numbers for KQkr are not very telling as I don't seem to have considerable cache misses anyway. Maybe it is just cachegrind. Anyway more if/when KQRkr finishes.
==31677==
==31677== I refs: 47,992,652,545
==31677== I1 misses: 4,832,577
==31677== L2i misses: 3,980
==31677== I1 miss rate: 0.01%
==31677== L2i miss rate: 0.00%
==31677==
==31677== D refs: 17,189,436,209 (12,765,330,115 rd + 4,424,106,094 wr )
==31677== D1 misses: 11,891,827 ( 8,838,240 rd + 3,053,587 wr )
==31677== L2d misses: 3,335,584 ( 3,155,709 rd + 179,875 wr )
==31677== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==31677== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==31677==
==31677== L2 refs: 16,724,404 ( 13,670,817 rd + 3,053,587 wr )
==31677== L2 misses: 3,339,564 ( 3,159,689 rd + 179,875 wr )
==31677== L2 miss rate: 0.0% ( 0.0% + 0.0%
Last edited by Daniel Shawul on Sat Jul 28, 2012 5:50 pm, edited 2 times in total.
-
syzygy
- Posts: 5840
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
Hmm, for the above I used "perf stat -d -d -d". If I use "perf stat -d -d" I get very similar numbers, except for:
The reported number of last-level cache loads (that would be L3 cache loads) goes down considerably. The reported number of last-level cache misses goes down enormously.
I am not sure what "-d -d" and "-d -d -d" are supposed to do exactly. I just copied the "-d -d -d" from a webpage.
edit:
Just using "-d" I get:
Weird.
Code: Select all
65,703,566 LLC-loads # 0.647 M/sec
73,131 LLC-load-misses # 0.11% of all LL-cache hitsI am not sure what "-d -d" and "-d -d -d" are supposed to do exactly. I just copied the "-d -d -d" from a webpage.
edit:
Just using "-d" I get:
Code: Select all
71,495,603 LLC-loads # 0.711 M/sec
18,642,996 LLC-load-misses # 26.08% of all LL-cache hits-
syzygy
- Posts: 5840
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
That was with "perf stat -e cache-misses".syzygy wrote:Well, one problem is that I did the transformation also in the main loop when looping over the table looking for work to do. When I loop through the table linearly, and do the reverse transformation before doing the work, things get a bit faster. Still, it is 10% slower.hgm wrote:That is a lot, and seems much more than could be explained from the execution time of the few extra instructions. This suggests it actually decreases cache efficiency.
A second problem is that my transformation might not sufficiently solve the problem, so I replaced (idx ^ (idx >> 12)) by (idx + (idx >> 12)) (with appropriate masking to make things work correctly). However, this only slows down things further.
I've measured cache-misses and they do indeed increase.
Measuring just the number of L1 dcache misses shows that twisting of the index lowers the miss rate from 3.18% to 2.35% for KQRvKQ. However, this does not offset the extra computations.
Measuring LLC-loads and LLC-load-misses using perf seems to be broken.
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
Maybe it is calculating the percentage from different refereces..
I am still waiting on the KQRKQ cache statistics as cachegrind slows down generation by more than 20x. KQKR is done on 4 MB RAM while KQRKQ uses about 249 MB. So I expect the LLC cache miss to increase from the non-telling 0% figure I got previously. In the mean time I did some tests to compare the effect of verfication. In general I get:
a)Using single bit visit indicator is slower by 3X compared to the one that uses full byte.
b)The beta-cutoff on non visited positions speeds up verfication so much that I got a 3X slowdown if I look for WIN/DRAWs anyway throught the verification list. So I guess my overhead is the move generation since it makes too much beta cutoffs. Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..
I am still waiting on the KQRKQ cache statistics as cachegrind slows down generation by more than 20x. KQKR is done on 4 MB RAM while KQRKQ uses about 249 MB. So I expect the LLC cache miss to increase from the non-telling 0% figure I got previously. In the mean time I did some tests to compare the effect of verfication. In general I get:
a)Using single bit visit indicator is slower by 3X compared to the one that uses full byte.
b)The beta-cutoff on non visited positions speeds up verfication so much that I got a 3X slowdown if I look for WIN/DRAWs anyway throught the verification list. So I guess my overhead is the move generation since it makes too much beta cutoffs. Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..
-
syzygy
- Posts: 5840
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
You shouldn't generate and store lists of moves at all. Just generate destination squares and use those to calculate the indices you need to check. Don't re-use a move generator from a normal chess engine.Daniel Shawul wrote:Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
Well that would be like the rybka like optimization that does delta-pruned captures while generating moves IIRC, or scoring while move generation etc... Very bad code for my style as I would like to keep it as simple as generating and storing the moves. Also my test which showed slow down by 3x even if I did the move generation in _both_ cases. That indicates move generation is not a problem at all but whether I check the results of all the positions or not ... That is why I want to calculate the cutoff statistics.syzygy wrote:You shouldn't generate and store lists of moves at all. Just generate destination squares and use those to calculate the indices you need to check. Don't re-use a move generator from a normal chess engine.Daniel Shawul wrote:Incremental move generation could even be used here since there will be too many cutoffs. I will calculate direct cutoff statistics and see if I can improve it further. The 3X slowdown indicates there is potential for some optimization around this area..
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Cutoff rates
I measured the percentage where we don't get off a cutoff when asking for verfication. The number of cutoffs is pretty high especially for some endgames and those are the difficult ones. Miss rate is worst when one side is significantly stronger. But there you have it more than 95% except in the cases where forward generation would have worked best.
I am going to squeeze in whatever I can get in from using incremental move generation now as the cutoff is pretty hight to ignore.
Code: Select all
Miss rate = 100 - cutoff rate
KQkq 0.06%
KQkr 4.18%
KQkb 8.04%
KQkn 11.54%
KQkp 23.67%
KRkr 0.17%
KRkb 0.53%
etc