pawn enumeration

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: price of complex indexing

Post by diep »

Daniel Shawul wrote:

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.96      5.08     5.08 80073496     0.00     0.00  ENUMERATOR::get_index(unsigned long&, bool)
 21.67      9.87     4.79        8     0.60     1.94  ENUMERATOR::backward_pass(unsigned char*, unsigned char*, unsigned char*, unsigned char*, unsigned long const&, unsigned long const&, bool)
 10.04     12.09     2.22 12024640     0.00     0.00  ENUMERATOR::get_retro_score(unsigned char*, unsigned char*, unsigned char*, unsigned char*, int, bool)
  9.64     14.22     2.13 195361391     0.00     0.00  SEARCHER::do_move(int const&)
  7.80     15.94     1.73 19882129     0.00     0.00  ENUMERATOR::get_pos(unsigned long)
  7.01     17.49     1.55 195361391     0.00     0.00  SEARCHER::undo_move(int const&)
  5.56     18.72     1.23 14874174     0.00     0.00  ENUMERATOR::get_init_score(unsigned char&, int, int, bool)
  5.29     19.89     1.17 210660367     0.00     0.00  SEARCHER::attacks(int, int) const
  3.44     20.65     0.76        8     0.10     0.77  ENUMERATOR::initial_pass(unsigned char*, unsigned char*, unsigned char*, unsigned char*, unsigned long const&, unsigned long const&, bool)
  2.96     21.31     0.66 14874174     0.00     0.00  SEARCHER::gen_all()
  1.36     21.61     0.30 12024640     0.00     0.00  SEARCHER::gen_retro()
  1.18     21.87     0.26        2     0.13    10.96  generate_slice(ENUMERATOR*, _IO_FILE*, _IO_FILE*)
  0.41     21.96     0.09                             print_sq(int const&)
  0.36     22.04     0.08                             SEARCHER::SEARCHER()
  0.27     22.10     0.06  4439741     0.00     0.00  SEARCHER::probe_bitbases(int&)
  0.05     22.11     0.01        7     0.00     0.00  ENUMERATOR::init()
  0.02     22.11     0.01                             SEARCHER::gen_noncaps()
I did a profile on generation of 4 men bitbases and guess what comes out at the top. Yes it is the complex indexing which is the price I pay for ellegance. After making non-capture moves (forward or retro), I calculate the index of the position from _scratch_. This is because I used all kinds of tricks for indexing. If it was a simple 64x64x.. etc it would have been possible to update incrementally by just taking out the contribution from the from square and then adding the to square contribution. I guess I will have to try naive indexing methods for speedy generation.. I think it may even be possible to incrementlly update with the indexing I have that could give me some speedup. OTOH move generation is so fast depsite what someone here might claim ;) If I did generation of 5 or 6 men involving disk access then move generation speed will be pretty irrelevant..
What i do is a tad more work. I have a function that's looping in the same manner as how the indexation works. So the way how it puts pieces on the board, that's exactly the same as how the indexation works. That's WAY FASTER. then for each lookup you do, you figure out where it is in incremental manners. The things you don't want to do is sophisticated tricks. What you can't avoid is mirroring and n of the same men, as the savings is too huge of that. Koistinen wanted to save also onto the 63 * 62 * bla bla, and just use either 48 for pawns or 64 there,

but that's something you have to test yourself. With a few tables you already get far there and just replace the tables.

Note that most profilers do not reflect correctly how much you would lose to the RAM when code would be optimized. Don't forget this remark please some time from now. The GCC/valgrind junk is notorious there. Intel's stuff doing a tad better job there.

p.s. doing things backwards is too slow - do things forward it's easier and faster and more bugfree.
Last edited by diep on Fri Jul 27, 2012 4:07 pm, edited 3 times in total.
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

Daniel Shawul wrote:I guess I will have to try naive indexing methods for speedy generation.. I think it may even be possible to incrementlly update with the indexing I have that could give me some speedup. OTOH move generation is so fast depsite what someone here might claim ;)
You don't really need to generate moves. Just loop through the pieces, generate the attacks to empty squares in a bitboard, loop through those squares and shift them into the correct position of the index. (Of course pawn moves are different, but those you do by slicing.)
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: pawn enumeration

Post by syzygy »

diep wrote:Note there is overhead for that method so it's not exactly 64 times faster. Let's say effectively factor 60.
A factor 60 compared to a naive implementation of forward generation.

The problem with forward generation is that you need to process each and every position in eeach iteration whereas the overwhelming majority of those positions don't need any processing at all when using backward generation.

For KQQQvK, forward generation might be close to competitive. For KBNvKN, forward generation will perform awfully.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

syzygy wrote:
Daniel Shawul wrote:I guess I will have to try naive indexing methods for speedy generation.. I think it may even be possible to incrementlly update with the indexing I have that could give me some speedup. OTOH move generation is so fast depsite what someone here might claim ;)
You don't really need to generate moves. Just loop through the pieces, generate the attacks to empty squares in a bitboard, loop through those squares and shift them into the correct position of the index. (Of course pawn moves are different, but those you do by slicing.)
Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck. For example going from 3men to 4men indexing became less of a bottlence, It reduced from 40% to the 22% you see in the above post. Cache works better in case of 3 men and my slow get_index have more effect. I would presume for 5men and 6men its effect will futher go down. BTW bitboards will be slow to movegne as you have to do a bitscan. I use 0x88 to generate all moves or retro moves. For non-captures peice lists are faster I think. We don't generate captures alone in egtbs..

A clever indexing may help to reduce RAM access costs. From vincent's posts I understand Koinstein's method can get a reduction on bandwidth by working on positions at the same cache line, i.e if the algorithm is used for in RAM generation too. The most mobile pieces taking the lower bits of the index could help in generation.
I changed the streaming order so that queens (most mobile) are placed last. Earlier I just optimized for compression by trying different streaming orders and taking the one that gave best compression over all. It turns out to be the opposite of what I had. Now I use this general streaming order.

Code: Select all

static const int piece_order[2][12] = {
	{bpawn,wpawn,bking,wking,bknight,wknight,bbishop,wbishop,brook,wrook,bqueen,wqueen}, 
	{wpawn,bpawn,wking,bking,wknight,bknight,wbishop,bbishop,wrook,brook,wqueen,bqueen}
};
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

diep wrote: What i do is a tad more work. I have a function that's looping in the same manner as how the indexation works. So the way how it puts pieces on the board, that's exactly the same as how the indexation works. That's WAY FASTER. then for each lookup you do, you figure out where it is in incremental manners. The things you don't want to do is sophisticated tricks. What you can't avoid is mirroring and n of the same men, as the savings is too huge of that. Koistinen wanted to save also onto the 63 * 62 * bla bla, and just use either 48 for pawns or 64 there,

but that's something you have to test yourself. With a few tables you already get far there and just replace the tables.
For 6 men with no alike peices I have 462x62x61x60x59 = 5.7G positons and naive is 64^6 =64G which is about a saving of 11x times, most of it coming from the 8x fold symmetry. So you are right that mirroring is a must but the rest are not. I think I will try to use his method for in RAM generation but that would need an 8G ram. My goal was to get 6 men on 32 bit computers with less than 4G ram but if I can get out a lot of speed from generation it will be worth it.
Note that most profilers do not reflect correctly how much you would lose to the RAM when code would be optimized. Don't forget this remark please some time from now. The GCC/valgrind junk is notorious there. Intel's stuff doing a tad better job there.

p.s. doing things backwards is too slow - do things forward it's easier and faster and more bugfree.
Even if I do forward,the RAM access would still be random. If Koinstein's method does the 8 fold symmetry and still "congregates" the positions after we make a move then it forward may be better but I doubt it. Also since retro moves are almost the same as forward moves, they would still benefit from whatever tricks you have for the former. I really don't see how it could be faster. What I do is a combination of both actually. You can save a lot of effort by retro moving lost potitions to won as that will need no verification at all. I also do retro moves from won-positions to lost followed by verification which is time consuming.
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
My experience is that opimisations in move generation (or rather "index generation") do have a noticeable impact on generation speed, at least for 5-piece tables.
BTW bitboards will be slow to movegne as you have to do a bitscan. I use 0x88 to generate all moves or retro moves.
Going from 0x88 to bitboards improved the speed of my generator quite a lot.
For non-captures peice lists are faster I think. We don't generate captures alone in egtbs..
My generator does not use a full-blown board representation as required by chess engines. For attack generation you only need a bitboard with the occupied squares, so I generate this from the index by looping and shifting. I have no board[64] or board[128] array to maintain, nor bitboards for each piece type.
A clever indexing may help to reduce RAM access costs. From vincent's posts I understand Koinstein's method can get a reduction on bandwidth by working on positions at the same cache line, i.e if the algorithm is used for in RAM generation too. The most mobile pieces taking the lower bits of the index could help in generation.
Koistinen proposes to use a 64^n-type indexing, but with the bits ordered differently within the index, and the squares permuted. I think this could potentially benefit my generator as well, but I haven't investigated yet if this is possible.
I changed the streaming order so that queens (most mobile) are placed last. Earlier I just optimized for compression by trying different streaming orders and taking the one that gave best compression over all.
After generation, I look for a good permutation by compressing various permutations of subsets of the 5-piece table, then convert the table according to that permutation and compress. (Latency of memory lookups during this conversion can easily be overlapped by (de-)indexing computations by prefetching.)
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
Actually, the probing is not so random at all. If idx and idx+1 are in the same cache line, indices reached from position idx generally are in cache lines also reached from position idx+1. (There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)

I believe the TLB trashing problem mentioned by Vincent is not much of a problem on Linux which will allocate huge pages for the tables.
User avatar
hgm
Posts: 27701
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: price of complex indexing

Post by hgm »

Daniel Shawul wrote:You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
The trick is to localize the accesses, and do as much as you can in cache. Like bringing a slice with fixed constellation of all black pieces, but containing all possible white piece constellations of white pieces in cache when you are making a pass for doing white moves. Then all accesses will be in that slice, until you start processing the next black constellation, after which you are completely done with the slice until the next (black) pass.

And if that does not fit, split the pass in two, where you do moves of two white pieces during one pass, and those of the other two white pieces in the other pass, so you can also freeze the white pieces whose moves you are not handling in that pass. Note that there always is one piece that you cannot freeze, but have to be cached in all possible locations even when you are not moving it, because cache lines are 64 byte. Usually I take this the black King. So slices with two movable white pieces in fact are 3-men slices, containing 256K positions.

Kings can also help in slicing: although slices with fixed King position wil not be invariant under King moving, the Kings move pretty locally, so you can treat their moves in an order that never requires you to have more than 4 King slices in cache at any time, when you do half the King moves in one white pass, and the other half in the other white pass.
syzygy wrote:(There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)
This can be avoided by using 64-byte spacer regions in between the higher slices. If you don't want to do that in the index, (e.g. to keep it easy to mirror the board by bit-manipulation on the index), you can use an index->address translation just before every access, like

address = index*01000101 >> 18;

This is still much faster than the access itself, so it will not cause a noticeable slowdown. (As it will be computed in parallel.)
Last edited by hgm on Fri Jul 27, 2012 6:23 pm, edited 1 time in total.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

syzygy wrote:
Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
My experience is that opimisations in move generation (or rather "index generation") do have a noticeable impact on generation speed, at least for 5-piece tables.
BTW bitboards will be slow to movegne as you have to do a bitscan. I use 0x88 to generate all moves or retro moves.
Going from 0x88 to bitboards improved the speed of my generator quite a lot.
For non-captures peice lists are faster I think. We don't generate captures alone in egtbs..
My generator does not use a full-blown board representation as required by chess engines. For attack generation you only need a bitboard with the occupied squares, so I generate this from the index by looping and shifting. I have no board[64] or board[128] array to maintain, nor bitboards for each piece type.
Yes bitboards are compact but the problem is not the code size but that the data we are working is big and randomly accessed. So it doesn't matter much if piece lists consume more memory. However the bitscan (even with hardware) should make things slower for non-capture type of move generation as is required for egtbs. Bitboards are faster only for captures which was also the case for my chess engine. There I use a hybrid approach that generates captures with bitboards and the rest with piece lists. Once you get a direction 0x88 is pretty fast and can't see why it bitboards would be faster. I never tried bitboards alone but it should be slower than what I use now...
A clever indexing may help to reduce RAM access costs. From vincent's posts I understand Koinstein's method can get a reduction on bandwidth by working on positions at the same cache line, i.e if the algorithm is used for in RAM generation too. The most mobile pieces taking the lower bits of the index could help in generation.
Koistinen proposes to use a 64^n-type indexing, but with the bits ordered differently within the index, and the squares permuted. I think this could potentially benefit my generator as well, but I haven't investigated yet if this is possible.
I think if it possible to exploit the 8xfold symmetry and still have locality for most mobile pieces, I will be tempted to use it atleast for generation. My get_index() function is just like the evaluation of a search tree. The searches are 1-ply deep so get_index is called branching factor (BF) times more than move generation which is why it is critical to make it faster. I think if i used a simple indexing scheme then move generation will have more effect but first the indexing has to be smart (by that I mean naive :) )...
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

syzygy wrote:
Daniel Shawul wrote:Well the point is move generation speed didn't matter even when I am doing eveything it in RAM. You make so many random access to RAM (pretty much like a hash table in chess engines) that will eventually be your bottleneck.
Actually, the probing is not so random at all. If idx and idx+1 are in the same cache line, indices reached from position idx generally are in cache lines also reached from position idx+1. (There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)

I believe the TLB trashing problem mentioned by Vincent is not much of a problem on Linux which will allocate huge pages for the tables.
No it is very random even after you make a single move. Note that you are not iterating over the indices, but over positions you get over one ply search... Think of the forward generator that checks the values of each of the resulting postions to get the parents score. Same for retro moves as well which is why i wonder why vincent thinks forward will be better??

Code: Select all

for(each move) {
    make_move
    get_index(); //very critical
    undo_move()
    update_score.
}
Ofcourse the parent positions you iterate are over the same cache line but that is not the issue. I do the get_index() from scratch since I use complex stuff such as (63,62,61 indexing, alike pieces, legal king positions etc..). If I just used 6 bits for all it would be very fast ofcourse as all I need to do is concatenate the bits to get an index...