I think this can benefit when you are trying to access disk in a systematic way. The L1 cache can hold so few positions ( one cache line holds 64bytes=256 postions). A full slice can not possibly be held in cache. The best you can achieve is to have moves of the most mobile piece alone (64 < 256) in cache when you try verification or forward move generation. The moves of the 2nd piece are 64 positions apart, the third 64x64 apart etc... So caching can only work for the least mobile piece. This is a bit different from generating 6 or 7 men there because you access much more data at one go i.e page size of 4k/8k compared to a cache line, and a larger cache which is RAM itself. Also if you loop over the indices like i do with complex indexing function and then make moves, the access will be very random except for whatever piece is put in at the bottom of the index (better be the most mobile). As you mentioned, it may be possible to make a systematic pass over the indices that re-uses children of a position probed by the first pass ,but the indexing have to be plain simple like a multi-dimensional array.
Even an easy looking operation like multiplying two matrices imposes lots of problems with cache. With A X B = C, A is accessed superbly but B has strides. So I do not expect something that involves move making and calculating a new index that could change access pattern significantly to gain much from this. The cache solution is usually to avoide strided access (i.e transpose the second matrix) then using tiles that can fit in L1. The latter can be tried in table base generation too by working on similar groups together but a queen move and a pawn move result in very far apart positions in the index. I think that the fact that I had to calculate indices from scratch after making move is an indication that my indexing needs to change for better cache utilization. But I agree with disk access it could help a lot since we are talking larger page sizes.
pawn enumeration
Moderator: Ras
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
-
syzygy
- Posts: 5895
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
As I already said, at least for 5-piece tables any optimisation of the move/index generation does pay off. Memory and cpu cache apparently can keep up, even with 8 hyperthreads.Daniel Shawul wrote:Yes bitboards are compact but the problem is not the code size but that the data we are working is big and randomly accessed. So it doesn't matter much if piece lists consume more memory. However the bitscan (even with hardware) should make things slower for non-capture type of move generation as is required for egtbs.
Again, going from 0x88 to bitboard for move generation increased generation speed quite a lot for my generator (at least 25%). And my generator is not slow. Magic bitboards are really fast. Bitscan is really fast. No need for checking whether a square is already occupied. Bitboards allow you to get rid of the board array that you do need for 0x88.Bitboards are faster only for captures which was also the case for my chess engine.
Generating captures is of course not needed at all when doing backward generation.
-
syzygy
- Posts: 5895
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
Maybe you should read a bit better what I write.Daniel Shawul wrote:No it is very random even after you make a single move. Note that you are not iterating over the indices, but over positions you get over one ply search... Think of the forward generator that checks the values of each of the resulting postions to get the parents score.syzygy wrote:Actually, the probing is not so random at all. If idx and idx+1 are in the same cache line, indices reached from position idx generally are in cache lines also reached from position idx+1. (There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
I believe the TLB trashing problem mentioned by Vincent is not much of a problem on Linux which will allocate huge pages for the tables.
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
Well I don't know how you did it but using magic move generation fills up your L1/L2 already. Why would you want to do that when you have something with no tables like 0x88 and no bitscans for bitboards. Your program and its stack will be in cache bitboards or not simple, speed up from getting rid of a board[128] is bullshit. We are not running multiple copeis of the program. Your large data however requires smart accessing. My move generation may start to have an impact when I have naive indexing.syzygy wrote:As I already said, at least for 5-piece tables any optimisation of the move/index generation does pay off. Memory and cpu cache apparently can keep up, even with 8 hyperthreads.Daniel Shawul wrote:Yes bitboards are compact but the problem is not the code size but that the data we are working is big and randomly accessed. So it doesn't matter much if piece lists consume more memory. However the bitscan (even with hardware) should make things slower for non-capture type of move generation as is required for egtbs.
Again, going from 0x88 to bitboard for move generation increased generation speed quite a lot for my generator (at least 25%). And my generator is not slow. Magic bitboards are really fast. Bitscan is really fast. No need for checking whether a square is already occupied. Bitboards allow you to get rid of the board array that you do need for 0x88.Bitboards are faster only for captures which was also the case for my chess engine.
Generating captures is of course not needed at all when doing backward generation.
-
syzygy
- Posts: 5895
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
Because it is significantly faster?Daniel Shawul wrote:Well I don't know how you did it but using magic move generation fills up your L1/L2 already. Why would you want to do that when you have something with no tables like 0x88 and no bitscans for bitboards.
How about: I do know what I am talking about? I'm not just blabbing without trying it first.Your program and its stack will be in cache bitboards or not simple, speed up from getting rid of a board[128] is bullshit.
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
The positions for the most mobile piece can be in cache but the rest are not so I do understand what you write. But in a single move gen you have access everywhere (kings moves, pawns move and the rest of the pieces except the last one) which is close to random. After I do a retro move, I go forward again so how can cache help in this case. I can only see a pattern in simple forward tb generation which nobody uses anyway. It is not as random as randomly hashed tt table ofcourse. Yes you can have good cache effect from placing the last piece on the least significant index but it is not the only piece moving. Also this is L1/L2 we are talking about that may already be busy caching other programs needs f.i magic tables.syzygy wrote:Maybe you should read a bit better what I write.Daniel Shawul wrote:No it is very random even after you make a single move. Note that you are not iterating over the indices, but over positions you get over one ply search... Think of the forward generator that checks the values of each of the resulting postions to get the parents score.syzygy wrote:Actually, the probing is not so random at all. If idx and idx+1 are in the same cache line, indices reached from position idx generally are in cache lines also reached from position idx+1. (There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
I believe the TLB trashing problem mentioned by Vincent is not much of a problem on Linux which will allocate huge pages for the tables.
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
Just because you say so doesn't mean it is true for everyone. How about giving a reason for why it is? I mentioned that an int[128] will be in cache anyway and it is wrong to forward a reason that 0x88 is bulky. You are running a single instance of the program. Move generation in chess engines is fast with 0x88 especially when captures are not invovled. I can only see it helping attacks() which could be the cause if you see speedups...syzygy wrote:Because it is significantly faster?Daniel Shawul wrote:Well I don't know how you did it but using magic move generation fills up your L1/L2 already. Why would you want to do that when you have something with no tables like 0x88 and no bitscans for bitboards.
How about: I do know what I am talking about? I'm not just blabbing without trying it first.Your program and its stack will be in cache bitboards or not simple, speed up from getting rid of a board[128] is bullshit.
-
syzygy
- Posts: 5895
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
No, you do not understand. Starting from a position with index idx, all parent positions (in the other table) have an index idx + A * 2 ^ (6 * k) for some A and some k. The parent positions for the next position with index (idx + 1) then are of the form (idx+1) + A * 2 ^ (6*k), often for the same A and k. Those have a high probability of being in cache. Access is not random at all.Daniel Shawul wrote:The positions for the most mobile piece can be in cache but the rest are not so I do understand what you write. But in a single move gen you have access everywhere (kings moves, pawns move and the rest of the pieces except the last one) which is close to random.syzygy wrote:Maybe you should read a bit better what I write.
You shouldn't. Just mark the position as a potential loss and leave the verification to the next pass. That will also reduce the total number of verfications needed, since there will often be multiple moves that lose in the same number of plies.After I do a retro move, I go forward again
-
syzygy
- Posts: 5895
- Joined: Tue Feb 28, 2012 11:56 pm
Re: price of complex indexing
Clearly the reason is that it is more efficient! Fewer instructions executed including fewer branches. Magic move generation must have good cache locality, because it is unbeatable by other techniques that use much smaller arrays.Daniel Shawul wrote:Just because you say so doesn't mean it is true for everyone. How about giving a reason for why it is?
That's not the reason I gave... You need to maintain it. That takes time. If you make that unnecessary, you save time. Saving time means the generation speeds up. Q.E.D.?I mentioned that an int[128] will be in cache anyway and it is wrong to forward a reason that 0x88 is bulky.
Look, it's fine if you don't think it will work for your program, but just saying that it is all wrong and bullshit doesn't look too good on you, at least not in my eyes.
Again, if you generate captures for tablebase generation you are doing something wrong. I'm not sure what you mean by "you are running a single instance of the program".You are running a single instance of the program. Move generation in chess engines is fast with 0x88 especially when captures are not invovled. I can only see it helping attacks() which could be the cause if you see speedups...
My generator generates and compresses all 3-4-5 piece WDL50+ and DTZ50+ tables in 2 hours 20 minutes on my laptop. Most of that time is spent on compression. Actual generation time is less than an hour (maybe 45 minutes or so). I have no numbers yet for 6 pieces.
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: price of complex indexing
First cache works if you access the same data right away. That is lost by the random access you make after each move gen. Doing a 2-ply search like I do adds more trouble to that. Also this kind of sequential access can only help in the initial pass since, after 2nd iteration or more you have holes in the array for most tablebases, unlike in the case when one side has many pieces KQQK f.i. In short better caching where forward works best which is usually the simples tablebases. On the first iteration I do captures/promotions which could incur significant slow downs anyway due to possilbe disk access.syzygy wrote:No, you do not understand. Starting from a position with index idx, all parent positions (in the other table) have an index idx + A * 2 ^ (6 * k) for some A and some k. The parent positions for the next position with index (idx + 1) then are of the form (idx+1) + A * 2 ^ (6*k), often for the same A and k. Those have a high probability of being in cache. Access is not random at all.Daniel Shawul wrote:The positions for the most mobile piece can be in cache but the rest are not so I do understand what you write. But in a single move gen you have access everywhere (kings moves, pawns move and the rest of the pieces except the last one) which is close to random.syzygy wrote:Maybe you should read a bit better what I write.
I use two methods one that does not do verification at all but uses counters and backward pass only. I use verification search for generating 6 men only when I can't afford a byte for that. So I mark the position as being visited and do two passes.You shouldn't. Just mark the position as a potential loss and leave the verification to the next pass. That will also reduce the total number of verfications needed, since there will often be multiple moves that lose in the same number of plies.After I do a retro move, I go forward again
First lost for black -> win for white same as using counters no verification needed.
Then second pass the win for black - > potential lost for white which I do the verification right away. Verification is fast most of the time because it takes only one un-examined position to beta cutoff.The overhead in that case is usually the move generation. So there are no counters to decrement to which makes adding a third pass in-appropriate. For my case the node which generated retro-moves to mark the position as potential lost will be marked not to generate retro moves again as it has already made its contribution. This sounds like a good way of doing the backward generating method with 1bit counter/visit bitmap.